RE: [RFC][PATCH 0/2] x86/boot/KASLR: Restrict kernel to be randomized in mirror regions if existed
Dear Baoquan, > > Our customer reported that Kernel text may be located on non-mirror > > region (movable zone) when both address range mirroring feature and > > KASLR are enabled. I know your customer :) > > The functions of address range mirroring feature are as follows. > > - The physical memory region whose descriptors in EFI memory map have > > EFI_MEMORY_MORE_RELIABLE attribute (bit: 16) are mirrored > > - The function arranges such mirror region into normal zone and other > region > > into movable zone in order to locate kernel code and data on mirror > > region > > > > So we need restrict kernel to be located inside mirror region if it is > > existed. > > > > The method is very simple. If efi is enabled, just iterate all efi > > memory map and pick up mirror region to process for adding candidate > > of slot. If efi disabled or no mirror region existed, still process > > e820 memory map. This won't bring much efficiency loss, at worst we > > just go through all efi memory maps and found no mirror. > > > > One question: > > From code, though mirror regions are existed, they are meaningful only > > if kernelcore=mirror kernel option is specified. Not sure if my > > understanding is correct. Your understanding is almost correct. Only when "kernelcore=mirror" specified, the above procedure works. But, if mirrored regions are existed, bootmem allocator tries to allocate from mirrored region independently of "kerenelcore=mirror" option. So, IMHO, kernel text is important, so putting it to mirrored (more reliable) region is reasonable whether or not "kernelcore=mirror" is specified. Anyway thanks for submitting patch. We have Address Range Mirroring capable machine, so we'll test your patch. Sincerely, Taku Izumi > > Since you are the author of kernelcore=mirror related code and expert on > mirror feature, could you help answer above question? > > Thanks > Baoquan > > > > NOTE: > > I haven't got a machine with efi mirror region enabled, so only test > > the > > e820 map processing case and the case of no mirror region on efi machine. > > So set this as a RFC patchset, will post formal one after above > > question is made clear and mirror issue test passed. > > > > Baoquan He (2): > > x86/boot/KASLR: Adapt process_e820_entry for all kinds of memory map > > x86/boot/KASLR: Restrict kernel to be randomized in mirror regions if > > existed > > > > arch/x86/boot/compressed/kaslr.c | 129 > > +++ > > 1 file changed, 104 insertions(+), 25 deletions(-) > > > > -- > > 2.5.5 > >
RE: [RFC][PATCH 0/2] x86/boot/KASLR: Restrict kernel to be randomized in mirror regions if existed
Dear Baoquan, > > Our customer reported that Kernel text may be located on non-mirror > > region (movable zone) when both address range mirroring feature and > > KASLR are enabled. I know your customer :) > > The functions of address range mirroring feature are as follows. > > - The physical memory region whose descriptors in EFI memory map have > > EFI_MEMORY_MORE_RELIABLE attribute (bit: 16) are mirrored > > - The function arranges such mirror region into normal zone and other > region > > into movable zone in order to locate kernel code and data on mirror > > region > > > > So we need restrict kernel to be located inside mirror region if it is > > existed. > > > > The method is very simple. If efi is enabled, just iterate all efi > > memory map and pick up mirror region to process for adding candidate > > of slot. If efi disabled or no mirror region existed, still process > > e820 memory map. This won't bring much efficiency loss, at worst we > > just go through all efi memory maps and found no mirror. > > > > One question: > > From code, though mirror regions are existed, they are meaningful only > > if kernelcore=mirror kernel option is specified. Not sure if my > > understanding is correct. Your understanding is almost correct. Only when "kernelcore=mirror" specified, the above procedure works. But, if mirrored regions are existed, bootmem allocator tries to allocate from mirrored region independently of "kerenelcore=mirror" option. So, IMHO, kernel text is important, so putting it to mirrored (more reliable) region is reasonable whether or not "kernelcore=mirror" is specified. Anyway thanks for submitting patch. We have Address Range Mirroring capable machine, so we'll test your patch. Sincerely, Taku Izumi > > Since you are the author of kernelcore=mirror related code and expert on > mirror feature, could you help answer above question? > > Thanks > Baoquan > > > > NOTE: > > I haven't got a machine with efi mirror region enabled, so only test > > the > > e820 map processing case and the case of no mirror region on efi machine. > > So set this as a RFC patchset, will post formal one after above > > question is made clear and mirror issue test passed. > > > > Baoquan He (2): > > x86/boot/KASLR: Adapt process_e820_entry for all kinds of memory map > > x86/boot/KASLR: Restrict kernel to be randomized in mirror regions if > > existed > > > > arch/x86/boot/compressed/kaslr.c | 129 > > +++ > > 1 file changed, 104 insertions(+), 25 deletions(-) > > > > -- > > 2.5.5 > >
RE: [bug discuss] fjes driver call trace warning, "PNP0C02" used in fjes seems like a bug,
Dear Gao, > From a SW perspective it like an acpi driver that uses "PNP0C02" > as driver ids to perform the driver match in the ACPI table. > > From my understanding this is wrong in principle because that identifier > must be used to reserve motherboard resources (see par 4.1.2 of the PCI > Firmware Specifications v3.2) > > Therefore such identifier it is used from > http://lxr.free-electrons.com/source/drivers/pnp/system.c > to reserve such resources. > > Basically your driver is breaking any other device that > needs to reserve motherboard resources through system.c > driver. > > @David Miller, what is your opinion about this? > I think this driver should be reverted... I'm willing to revise my driver if it's something wrong. I can't reproduce this problem. Could you please show me how to reproduce problem ? Sincerely, Taku Izumi
RE: [bug discuss] fjes driver call trace warning, "PNP0C02" used in fjes seems like a bug,
Dear Gao, > From a SW perspective it like an acpi driver that uses "PNP0C02" > as driver ids to perform the driver match in the ACPI table. > > From my understanding this is wrong in principle because that identifier > must be used to reserve motherboard resources (see par 4.1.2 of the PCI > Firmware Specifications v3.2) > > Therefore such identifier it is used from > http://lxr.free-electrons.com/source/drivers/pnp/system.c > to reserve such resources. > > Basically your driver is breaking any other device that > needs to reserve motherboard resources through system.c > driver. > > @David Miller, what is your opinion about this? > I think this driver should be reverted... I'm willing to revise my driver if it's something wrong. I can't reproduce this problem. Could you please show me how to reproduce problem ? Sincerely, Taku Izumi
RE: [PATCH] net: fjes: fjes_main: Remove create_workqueue
Dear Bhaktipriya, Thanks. Looks good to me. Sincerely, Taku Izumi > -Original Message- > From: Bhaktipriya Shridhar [mailto:bhaktipriy...@gmail.com] > Sent: Thursday, June 02, 2016 6:31 PM > To: David S. Miller; Izumi, Taku/泉 拓; Florian Westphal; Bhaktipriya Shridhar > Cc: Tejun Heo; net...@vger.kernel.org; linux-kernel@vger.kernel.org > Subject: [PATCH] net: fjes: fjes_main: Remove create_workqueue > > alloc_workqueue replaces deprecated create_workqueue(). > > The workqueue adapter->txrx_wq has workitem > >raise_intr_rxdata_task per adapter. Extended Socket Network > Device is shared memory based, so someone's transmission denotes other's > reception. raise_intr_rxdata_task raises interruption of receivers from > the sender in order to notify receivers. > > The workqueue adapter->control_wq has workitem > >interrupt_watch_task per adapter. interrupt_watch_task is used > to prevent delay of interrupts. > > Dedicated workqueues have been used in both cases since the workitems > on the workqueues are involved in normal device operation and require > forward progress under memory pressure. > > max_active has been set to 0 since there is no need for throttling > the number of active work items. > > Since network devices may be used for memory reclaim, > WQ_MEM_RECLAIM has been set to guarantee forward progress. > > Signed-off-by: Bhaktipriya Shridhar <bhaktipriy...@gmail.com> > --- > drivers/net/fjes/fjes_main.c | 5 +++-- > 1 file changed, 3 insertions(+), 2 deletions(-) > > diff --git a/drivers/net/fjes/fjes_main.c b/drivers/net/fjes/fjes_main.c > index 86c331b..9006877 100644 > --- a/drivers/net/fjes/fjes_main.c > +++ b/drivers/net/fjes/fjes_main.c > @@ -1187,8 +1187,9 @@ static int fjes_probe(struct platform_device *plat_dev) > adapter->force_reset = false; > adapter->open_guard = false; > > - adapter->txrx_wq = create_workqueue(DRV_NAME "/txrx"); > - adapter->control_wq = create_workqueue(DRV_NAME "/control"); > + adapter->txrx_wq = alloc_workqueue(DRV_NAME "/txrx", WQ_MEM_RECLAIM, 0); > + adapter->control_wq = alloc_workqueue(DRV_NAME "/control", > + WQ_MEM_RECLAIM, 0); > > INIT_WORK(>tx_stall_task, fjes_tx_stall_task); > INIT_WORK(>raise_intr_rxdata_task, > -- > 2.1.4 >
RE: [PATCH] net: fjes: fjes_main: Remove create_workqueue
Dear Bhaktipriya, Thanks. Looks good to me. Sincerely, Taku Izumi > -Original Message- > From: Bhaktipriya Shridhar [mailto:bhaktipriy...@gmail.com] > Sent: Thursday, June 02, 2016 6:31 PM > To: David S. Miller; Izumi, Taku/泉 拓; Florian Westphal; Bhaktipriya Shridhar > Cc: Tejun Heo; net...@vger.kernel.org; linux-kernel@vger.kernel.org > Subject: [PATCH] net: fjes: fjes_main: Remove create_workqueue > > alloc_workqueue replaces deprecated create_workqueue(). > > The workqueue adapter->txrx_wq has workitem > >raise_intr_rxdata_task per adapter. Extended Socket Network > Device is shared memory based, so someone's transmission denotes other's > reception. raise_intr_rxdata_task raises interruption of receivers from > the sender in order to notify receivers. > > The workqueue adapter->control_wq has workitem > >interrupt_watch_task per adapter. interrupt_watch_task is used > to prevent delay of interrupts. > > Dedicated workqueues have been used in both cases since the workitems > on the workqueues are involved in normal device operation and require > forward progress under memory pressure. > > max_active has been set to 0 since there is no need for throttling > the number of active work items. > > Since network devices may be used for memory reclaim, > WQ_MEM_RECLAIM has been set to guarantee forward progress. > > Signed-off-by: Bhaktipriya Shridhar > --- > drivers/net/fjes/fjes_main.c | 5 +++-- > 1 file changed, 3 insertions(+), 2 deletions(-) > > diff --git a/drivers/net/fjes/fjes_main.c b/drivers/net/fjes/fjes_main.c > index 86c331b..9006877 100644 > --- a/drivers/net/fjes/fjes_main.c > +++ b/drivers/net/fjes/fjes_main.c > @@ -1187,8 +1187,9 @@ static int fjes_probe(struct platform_device *plat_dev) > adapter->force_reset = false; > adapter->open_guard = false; > > - adapter->txrx_wq = create_workqueue(DRV_NAME "/txrx"); > - adapter->control_wq = create_workqueue(DRV_NAME "/control"); > + adapter->txrx_wq = alloc_workqueue(DRV_NAME "/txrx", WQ_MEM_RECLAIM, 0); > + adapter->control_wq = alloc_workqueue(DRV_NAME "/control", > + WQ_MEM_RECLAIM, 0); > > INIT_WORK(>tx_stall_task, fjes_tx_stall_task); > INIT_WORK(>raise_intr_rxdata_task, > -- > 2.1.4 >
RE: [PATCH v3 2/2] mm: Introduce kernelcore=mirror option
Dear Xishi, Sorry for late. > -Original Message- > From: Xishi Qiu [mailto:qiuxi...@huawei.com] > Sent: Friday, December 11, 2015 6:44 PM > To: Izumi, Taku/泉 拓 > Cc: Luck, Tony; linux-kernel@vger.kernel.org; linux...@kvack.org; > a...@linux-foundation.org; Kamezawa, Hiroyuki/亀澤 寛 > 之; m...@csn.ul.ie; Hansen, Dave; m...@codeblueprint.co.uk > Subject: Re: [PATCH v3 2/2] mm: Introduce kernelcore=mirror option > > On 2015/12/11 13:53, Izumi, Taku wrote: > > > Dear Xishi, > > > >> Hi Taku, > >> > >> Whether it is possible that we rewrite the fallback function in buddy > >> system > >> when zone_movable and mirrored_kernelcore are both enabled? > > > > What does "when zone_movable and mirrored_kernelcore are both enabled?" > > mean ? > > > > My patchset just provides a new way to create ZONE_MOVABLE. > > > > Hi Taku, > > I mean when zone_movable is from kernelcore=mirror, not kernelcore=nn[KMG]. I'm not quite sure what you are saying, but if you want to screen user memory so that one is allocated from mirrored zone and another is from non-mirrored zone, I think it is possible to reuse my patchset. Sincerely, Taku Izumi > Thanks, > Xishi Qiu > > > Sincerely, > > Taku Izumi > >> > >> It seems something like that we add a new zone but the name is > >> zone_movable, > >> not zone_mirror. And the prerequisite is that we won't enable these two > >> features(movable memory and mirrored memory) at the same time. Thus we can > >> reuse the code of movable zone. > >> > > > > -- > > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > > the body of a message to majord...@vger.kernel.org > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > Please read the FAQ at http://www.tux.org/lkml/ > > > > . > > > > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH v3 2/2] mm: Introduce kernelcore=mirror option
Dear Xishi, Sorry for late. > -Original Message- > From: Xishi Qiu [mailto:qiuxi...@huawei.com] > Sent: Friday, December 11, 2015 6:44 PM > To: Izumi, Taku/泉 拓 > Cc: Luck, Tony; linux-kernel@vger.kernel.org; linux...@kvack.org; > a...@linux-foundation.org; Kamezawa, Hiroyuki/亀澤 寛 > 之; m...@csn.ul.ie; Hansen, Dave; m...@codeblueprint.co.uk > Subject: Re: [PATCH v3 2/2] mm: Introduce kernelcore=mirror option > > On 2015/12/11 13:53, Izumi, Taku wrote: > > > Dear Xishi, > > > >> Hi Taku, > >> > >> Whether it is possible that we rewrite the fallback function in buddy > >> system > >> when zone_movable and mirrored_kernelcore are both enabled? > > > > What does "when zone_movable and mirrored_kernelcore are both enabled?" > > mean ? > > > > My patchset just provides a new way to create ZONE_MOVABLE. > > > > Hi Taku, > > I mean when zone_movable is from kernelcore=mirror, not kernelcore=nn[KMG]. I'm not quite sure what you are saying, but if you want to screen user memory so that one is allocated from mirrored zone and another is from non-mirrored zone, I think it is possible to reuse my patchset. Sincerely, Taku Izumi > Thanks, > Xishi Qiu > > > Sincerely, > > Taku Izumi > >> > >> It seems something like that we add a new zone but the name is > >> zone_movable, > >> not zone_mirror. And the prerequisite is that we won't enable these two > >> features(movable memory and mirrored memory) at the same time. Thus we can > >> reuse the code of movable zone. > >> > > > > -- > > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > > the body of a message to majord...@vger.kernel.org > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > Please read the FAQ at http://www.tux.org/lkml/ > > > > . > > > > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH v3 2/2] mm: Introduce kernelcore=mirror option
Dear Xishi, > Hi Taku, > > Whether it is possible that we rewrite the fallback function in buddy system > when zone_movable and mirrored_kernelcore are both enabled? What does "when zone_movable and mirrored_kernelcore are both enabled?" mean ? My patchset just provides a new way to create ZONE_MOVABLE. Sincerely, Taku Izumi > > It seems something like that we add a new zone but the name is zone_movable, > not zone_mirror. And the prerequisite is that we won't enable these two > features(movable memory and mirrored memory) at the same time. Thus we can > reuse the code of movable zone. > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH v3 2/2] mm: Introduce kernelcore=mirror option
Dear Xishi, > Hi Taku, > > Whether it is possible that we rewrite the fallback function in buddy system > when zone_movable and mirrored_kernelcore are both enabled? What does "when zone_movable and mirrored_kernelcore are both enabled?" mean ? My patchset just provides a new way to create ZONE_MOVABLE. Sincerely, Taku Izumi > > It seems something like that we add a new zone but the name is zone_movable, > not zone_mirror. And the prerequisite is that we won't enable these two > features(movable memory and mirrored memory) at the same time. Thus we can > reuse the code of movable zone. > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH v3 2/2] mm: Introduce kernelcore=mirror option
Dear Tony, Xishi, > >> How about add some comment, if mirrored memroy is too small, then the > >> normal zone is small, so it may be oom. > >> The mirrored memory is at least 1/64 of whole memory, because struct > >> pages usually take 64 bytes per page. > > > > 1/64th is the absolute lower bound (for the page structures as you say). I > > expect people will need to configure 10% or more to run any real workloads. > > > > I made the memblock boot time allocator fall back to non-mirrored memory > > if mirrored memory ran out. What happens in the run time allocator if the > > non-movable zones run out of pages? Will we allocate kernel pages from > > movable > > memory? > > > > As I know, the kernel pages will not allocated from movable zone. Yes, kernel pages are not allocated from ZONE_MOVABLE. In this case administrator must review and reconfigure the mirror ratio via "MirrorRequest" EFI variable. Sincerely, Taku Izumi > > > -- > > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > > the body of a message to majord...@vger.kernel.org > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > Please read the FAQ at http://www.tux.org/lkml/ > > > > . > > > > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH v3 2/2] mm: Introduce kernelcore=mirror option
Dear Tony, Xishi, > >> How about add some comment, if mirrored memroy is too small, then the > >> normal zone is small, so it may be oom. > >> The mirrored memory is at least 1/64 of whole memory, because struct > >> pages usually take 64 bytes per page. > > > > 1/64th is the absolute lower bound (for the page structures as you say). I > > expect people will need to configure 10% or more to run any real workloads. > > > > I made the memblock boot time allocator fall back to non-mirrored memory > > if mirrored memory ran out. What happens in the run time allocator if the > > non-movable zones run out of pages? Will we allocate kernel pages from > > movable > > memory? > > > > As I know, the kernel pages will not allocated from movable zone. Yes, kernel pages are not allocated from ZONE_MOVABLE. In this case administrator must review and reconfigure the mirror ratio via "MirrorRequest" EFI variable. Sincerely, Taku Izumi > > > -- > > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > > the body of a message to majord...@vger.kernel.org > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > Please read the FAQ at http://www.tux.org/lkml/ > > > > . > > > > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH v2 2/2] mm: Introduce kernelcore=reliable option
Dear Xishi, Thanks for reviewing. > -Original Message- > From: Xishi Qiu [mailto:qiuxi...@huawei.com] > Sent: Wednesday, December 09, 2015 11:26 AM > To: Izumi, Taku/泉 拓 > Cc: linux-kernel@vger.kernel.org; linux...@kvack.org; tony.l...@intel.com; > Kamezawa, Hiroyuki/亀澤 寛之; m...@csn.ul.ie; > a...@linux-foundation.org; dave.han...@intel.com; m...@codeblueprint.co.uk > Subject: Re: [PATCH v2 2/2] mm: Introduce kernelcore=reliable option > > On 2015/11/27 23:04, Taku Izumi wrote: > > > This patch extends existing "kernelcore" option and > > introduces kernelcore=reliable option. By specifying > > "reliable" instead of specifying the amount of memory, > > non-reliable region will be arranged into ZONE_MOVABLE. > > > > v1 -> v2: > > - Refine so that the following case also can be > >handled properly: > > > > Node X: |MM--MM| > >(legend) M: mirrored -: not mirrrored > > > > In this case, ZONE_NORMAL and ZONE_MOVABLE are > > arranged like bellow: > > > > Node X: |--| > > |ooxxxxxxoo| ZONE_NORMAL > > |ooxx| ZONE_MOVABLE > >(legend) o: present x: absent > > > > Signed-off-by: Taku Izumi > > --- > > Documentation/kernel-parameters.txt | 9 ++- > > mm/page_alloc.c | 110 > > ++-- > > 2 files changed, 112 insertions(+), 7 deletions(-) > > > > diff --git a/Documentation/kernel-parameters.txt > > b/Documentation/kernel-parameters.txt > > index f8aae63..ed44c2c8 100644 > > --- a/Documentation/kernel-parameters.txt > > +++ b/Documentation/kernel-parameters.txt > > @@ -1695,7 +1695,8 @@ bytes respectively. Such letter suffixes can also be > > entirely omitted. > > > > keepinitrd [HW,ARM] > > > > - kernelcore=nn[KMG] [KNL,X86,IA-64,PPC] This parameter > > + kernelcore= Format: nn[KMG] | "reliable" > > + [KNL,X86,IA-64,PPC] This parameter > > specifies the amount of memory usable by the kernel > > for non-movable allocations. The requested amount is > > spread evenly throughout all nodes in the system. The > > @@ -1711,6 +1712,12 @@ bytes respectively. Such letter suffixes can also be > > entirely omitted. > > use the HighMem zone if it exists, and the Normal > > zone if it does not. > > > > + Instead of specifying the amount of memory (nn[KMS]), > > + you can specify "reliable" option. In case "reliable" > > + option is specified, reliable memory is used for > > + non-movable allocations and remaining memory is used > > + for Movable pages. > > + > > kgdbdbgp= [KGDB,HW] kgdb over EHCI usb debug port. > > Format: [,poll interval] > > The controller # is the number of the ehci usb debug > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > > index acb0b4e..006a3d8 100644 > > --- a/mm/page_alloc.c > > +++ b/mm/page_alloc.c > > @@ -251,6 +251,7 @@ static unsigned long __meminitdata > > arch_zone_highest_possible_pfn[MAX_NR_ZONES]; > > static unsigned long __initdata required_kernelcore; > > static unsigned long __initdata required_movablecore; > > static unsigned long __meminitdata zone_movable_pfn[MAX_NUMNODES]; > > +static bool reliable_kernelcore; > > > > /* movable_zone is the "real" zone pages in ZONE_MOVABLE are taken from */ > > int movable_zone; > > @@ -4472,6 +4473,7 @@ void __meminit memmap_init_zone(unsigned long size, > > int nid, unsigned long zone, > > unsigned long pfn; > > struct zone *z; > > unsigned long nr_initialised = 0; > > + struct memblock_region *r = NULL, *tmp; > > > > if (highest_memmap_pfn < end_pfn - 1) > > highest_memmap_pfn = end_pfn - 1; > > @@ -4491,6 +4493,38 @@ void __meminit memmap_init_zone(unsigned long size, > > int nid, unsigned long zone, > > if (!update_defer_init(pgdat, pfn, end_pfn, > > _initialised)) > > break; > > + > > + /* > > +* if not reliable_kernelcore and ZONE_MOVABLE exists, > > +* range from zone_mova
[PATCH v3 2/2] mm: Introduce kernelcore=mirror option
This patch extends existing "kernelcore" option and introduces kernelcore=mirror option. By specifying "mirror" instead of specifying the amount of memory, non-mirrored (non-reliable) region will be arranged into ZONE_MOVABLE. v1 -> v2: - Refine so that the following case also can be handled properly: Node X: |MM--MM| (legend) M: mirrored -: not mirrrored In this case, ZONE_NORMAL and ZONE_MOVABLE are arranged like bellow: Node X: |MM--MM| |ooxxoo| ZONE_NORMAL |ooxx| ZONE_MOVABLE (legend) o: present x: absent v2 -> v3: - change the option name from kernelcore=reliable into kernelcore=mirror - documentation fix so that users can understand nn[KMS] and mirror are exclusive Signed-off-by: Taku Izumi --- Documentation/kernel-parameters.txt | 11 +++- mm/page_alloc.c | 110 ++-- 2 files changed, 114 insertions(+), 7 deletions(-) diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt index f8aae63..b0ffc76 100644 --- a/Documentation/kernel-parameters.txt +++ b/Documentation/kernel-parameters.txt @@ -1695,7 +1695,8 @@ bytes respectively. Such letter suffixes can also be entirely omitted. keepinitrd [HW,ARM] - kernelcore=nn[KMG] [KNL,X86,IA-64,PPC] This parameter + kernelcore= Format: nn[KMG] | "mirror" + [KNL,X86,IA-64,PPC] This parameter specifies the amount of memory usable by the kernel for non-movable allocations. The requested amount is spread evenly throughout all nodes in the system. The @@ -1711,6 +1712,14 @@ bytes respectively. Such letter suffixes can also be entirely omitted. use the HighMem zone if it exists, and the Normal zone if it does not. + Instead of specifying the amount of memory (nn[KMS]), + you can specify "mirror" option. In case "mirror" + option is specified, mirrored (reliable) memory is used + for non-movable allocations and remaining memory is used + for Movable pages. nn[KMS] and "mirror" are exclusive, + so you can NOT specify nn[KMG] and "mirror" at the same + time. + kgdbdbgp= [KGDB,HW] kgdb over EHCI usb debug port. Format: [,poll interval] The controller # is the number of the ehci usb debug diff --git a/mm/page_alloc.c b/mm/page_alloc.c index acb0b4e..4157476 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -251,6 +251,7 @@ static unsigned long __meminitdata arch_zone_highest_possible_pfn[MAX_NR_ZONES]; static unsigned long __initdata required_kernelcore; static unsigned long __initdata required_movablecore; static unsigned long __meminitdata zone_movable_pfn[MAX_NUMNODES]; +static bool mirrored_kernelcore; /* movable_zone is the "real" zone pages in ZONE_MOVABLE are taken from */ int movable_zone; @@ -4472,6 +4473,7 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone, unsigned long pfn; struct zone *z; unsigned long nr_initialised = 0; + struct memblock_region *r = NULL, *tmp; if (highest_memmap_pfn < end_pfn - 1) highest_memmap_pfn = end_pfn - 1; @@ -4491,6 +4493,38 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone, if (!update_defer_init(pgdat, pfn, end_pfn, _initialised)) break; + + /* +* if not mirrored_kernelcore and ZONE_MOVABLE exists, +* range from zone_movable_pfn[nid] to end of each node +* should be ZONE_MOVABLE not ZONE_NORMAL. skip it. +*/ + if (!mirrored_kernelcore && zone_movable_pfn[nid]) + if (zone == ZONE_NORMAL && + pfn >= zone_movable_pfn[nid]) + continue; + + /* +* check given memblock attribute by firmware which +* can affect kernel memory layout. +* if zone==ZONE_MOVABLE but memory is mirrored, +* it's an overlapped memmap init. skip it. +*/ + if (mirrored_kernelcore && zone == ZONE_MOVABLE) { + if (!r || +
[PATCH v3 1/2] mm: Calculate zone_start_pfn at zone_spanned_pages_in_node()
Currently each zone's zone_start_pfn is calculated at free_area_init_core(). However zone's range is fixed at the time when invoking zone_spanned_pages_in_node(). This patch changes each zone->zone_start_pfn is calculated at zone_spanned_pages_in_node(). Signed-off-by: Taku Izumi --- mm/page_alloc.c | 30 +++--- 1 file changed, 19 insertions(+), 11 deletions(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 17a3c66..acb0b4e 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -4928,31 +4928,31 @@ static unsigned long __meminit zone_spanned_pages_in_node(int nid, unsigned long zone_type, unsigned long node_start_pfn, unsigned long node_end_pfn, + unsigned long *zone_start_pfn, + unsigned long *zone_end_pfn, unsigned long *ignored) { - unsigned long zone_start_pfn, zone_end_pfn; - /* When hotadd a new node from cpu_up(), the node should be empty */ if (!node_start_pfn && !node_end_pfn) return 0; /* Get the start and end of the zone */ - zone_start_pfn = arch_zone_lowest_possible_pfn[zone_type]; - zone_end_pfn = arch_zone_highest_possible_pfn[zone_type]; + *zone_start_pfn = arch_zone_lowest_possible_pfn[zone_type]; + *zone_end_pfn = arch_zone_highest_possible_pfn[zone_type]; adjust_zone_range_for_zone_movable(nid, zone_type, node_start_pfn, node_end_pfn, - _start_pfn, _end_pfn); + zone_start_pfn, zone_end_pfn); /* Check that this node has pages within the zone's required range */ - if (zone_end_pfn < node_start_pfn || zone_start_pfn > node_end_pfn) + if (*zone_end_pfn < node_start_pfn || *zone_start_pfn > node_end_pfn) return 0; /* Move the zone boundaries inside the node if necessary */ - zone_end_pfn = min(zone_end_pfn, node_end_pfn); - zone_start_pfn = max(zone_start_pfn, node_start_pfn); + *zone_end_pfn = min(*zone_end_pfn, node_end_pfn); + *zone_start_pfn = max(*zone_start_pfn, node_start_pfn); /* Return the spanned pages */ - return zone_end_pfn - zone_start_pfn; + return *zone_end_pfn - *zone_start_pfn; } /* @@ -5017,6 +5017,8 @@ static inline unsigned long __meminit zone_spanned_pages_in_node(int nid, unsigned long zone_type, unsigned long node_start_pfn, unsigned long node_end_pfn, + unsigned long *zone_start_pfn, + unsigned long *zone_end_pfn, unsigned long *zones_size) { return zones_size[zone_type]; @@ -5047,15 +5049,22 @@ static void __meminit calculate_node_totalpages(struct pglist_data *pgdat, for (i = 0; i < MAX_NR_ZONES; i++) { struct zone *zone = pgdat->node_zones + i; + unsigned long zone_start_pfn, zone_end_pfn; unsigned long size, real_size; size = zone_spanned_pages_in_node(pgdat->node_id, i, node_start_pfn, node_end_pfn, + _start_pfn, + _end_pfn, zones_size); real_size = size - zone_absent_pages_in_node(pgdat->node_id, i, node_start_pfn, node_end_pfn, zholes_size); + if (size) + zone->zone_start_pfn = zone_start_pfn; + else + zone->zone_start_pfn = 0; zone->spanned_pages = size; zone->present_pages = real_size; @@ -5176,7 +5185,6 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat) { enum zone_type j; int nid = pgdat->node_id; - unsigned long zone_start_pfn = pgdat->node_start_pfn; int ret; pgdat_resize_init(pgdat); @@ -5192,6 +5200,7 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat) for (j = 0; j < MAX_NR_ZONES; j++) { struct zone *zone = pgdat->node_zones + j; unsigned long size, realsize, freesize, memmap_pages; + unsigned long zone_start_pfn = zone->zone_start_pfn; size = zone->spanned_pages; realsize = freesize = zone->present_pages
[PATCH v3 0/2] mm: Introduce kernelcore=mirror option
Xeon E7 v3 based systems supports Address Range Mirroring and UEFI BIOS complied with UEFI spec 2.5 can notify which ranges are mirrored (reliable) via EFI memory map. Now Linux kernel utilize its information and allocates boot time memory from reliable region. My requirement is: - allocate kernel memory from mirrored region - allocate user memory from non-mirrored region In order to meet my requirement, ZONE_MOVABLE is useful. By arranging non-mirrored range into ZONE_MOVABLE, mirrored memory is used for kernel allocations. My idea is to extend existing "kernelcore" option and introduces kernelcore=mirror option. By specifying "mirror" instead of specifying the amount of memory, non-mirrored region will be arranged into ZONE_MOVABLE. Earlier discussions are at: https://lkml.org/lkml/2015/10/9/24 https://lkml.org/lkml/2015/10/15/9 https://lkml.org/lkml/2015/11/27/18 For example, suppose 2-nodes system with the following memory range: node 0 [mem 0x1000-0x00109fff] node 1 [mem 0x0010a000-0x00209fff] and the following ranges are marked as reliable (mirrored): [0x-0x0001] [0x0001-0x00018000] [0x0008-0x00088000] [0x0010a000-0x00112000] [0x0017a000-0x00182000] If you specify kernelcore=mirror, ZONE_NORMAL and ZONE_MOVABLE are arranged like bellow: - node 0: ZONE_NORMAL : [0x0001-0x0010a000] ZONE_MOVABLE: [0x00018000-0x0010a000] - node 1: ZONE_NORMAL : [0x0010a000-0x0020a000] ZONE_MOVABLE: [0x00112000-0x0020a000] In overlapped range, pages to be ZONE_MOVABLE in ZONE_NORMAL are treated as absent pages, and vice versa. v1 -> v2: Refine so that the above example case also can be handled properly: v2 -> v3: Change the option name from kernelcore=reliable into kernelcore=mirror and some documentation fix according to Andrew Morton's point Taku Izumi (2): mm: Calculate zone_start_pfn at zone_spanned_pages_in_node() mm: Introduce kernelcore=mirror option Documentation/kernel-parameters.txt | 11 ++- mm/page_alloc.c | 140 +++- 2 files changed, 133 insertions(+), 18 deletions(-) -- 1.9.1 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH v2 0/2] mm: Introduce kernelcore=reliable option
Dear Tony, > > Which do you think is beter ? > >- change into kernelcore="mirrored" > >- keep kernelcore="reliable" and minmal printk fix > > UEFI came up with the "reliable" wording (as a more generic term ... > as Andrew said > it could cover differences in ECC modes, or some alternate memory > technology that > has lower error rates). > > But I personally like "mirror" more ... it matches current > implementation. Of course > I'll look silly if some future system does something other than mirror. > Okay, I'll change the option name into kernelcore=mirror. Sincerely, Taku Izumi
RE: [PATCH v2 0/2] mm: Introduce kernelcore=reliable option
Dear Tony, Thanks for testing! Dear Andrew, > > Xeon E7 v3 based systems supports Address Range Mirroring > > and UEFI BIOS complied with UEFI spec 2.5 can notify which > > ranges are reliable (mirrored) via EFI memory map. > > Now Linux kernel utilize its information and allocates > > boot time memory from reliable region. > > > > My requirement is: > > - allocate kernel memory from reliable region > > - allocate user memory from non-reliable region > > > > In order to meet my requirement, ZONE_MOVABLE is useful. > > By arranging non-reliable range into ZONE_MOVABLE, > > reliable memory is only used for kernel allocations. > > > > My idea is to extend existing "kernelcore" option and > > introduces kernelcore=reliable option. By specifying > > "reliable" instead of specifying the amount of memory, > > non-reliable region will be arranged into ZONE_MOVABLE. > > It is unfortunate that the kernel presently refers to this memory as > "mirrored", but this patchset introduces the new term "reliable". I > think it would be better if we use "mirrored" throughout. > Of course, mirroring isn't the only way to get reliable memory. YES. "mirroring" is not the only way. So, in my opinion, we should change "mirrored" into "reliable" in order to match terms of UEFI 2.5 spec. > Perhaps if a part of the system memory has ECC correction then this > also can be accessed using "reliable", in which case your proposed > naming makes sense. reliable == mirrored || ecc? "reliable" is better. But, I'm willing to change "reliable" into "mirrored". Otherwise, I keep "kernelcore=reliable" and add the following minimal fix as a separate patch: diff a/arch/x86/platform/efi/efi.c b/arch/x86/platform/efi/efi.c --- a/arch/x86/platform/efi/efi.c +++ b/arch/x86/platform/efi/efi.c @@ -134,7 +134,7 @@ void __init efi_find_mirror(void) } } if (mirror_size) - pr_info("Memory: %lldM/%lldM mirrored memory\n", + pr_info("Memory: %lldM/%lldM reliable memory\n", mirror_size>>20, total_size>>20); } Which do you think is beter ? - change into kernelcore="mirrored" - keep kernelcore="reliable" and minmal printk fix > > Secondly, does this patchset mean that kernelcore=reliable and > kernelcore=100M are exclusive? Or can the user specify > "kernelcore=reliable,kernelcore=100M" to use 100M of reliable memory > for kernelcore? No, these are exclusive. > > This is unclear from the documentation and I suggest that this be > spelled out. Thanks. I'll update its document. Sincerely, Taku Izumi -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH v2 0/2] mm: Introduce kernelcore=reliable option
Dear Tony, Thanks for testing! Dear Andrew, > > Xeon E7 v3 based systems supports Address Range Mirroring > > and UEFI BIOS complied with UEFI spec 2.5 can notify which > > ranges are reliable (mirrored) via EFI memory map. > > Now Linux kernel utilize its information and allocates > > boot time memory from reliable region. > > > > My requirement is: > > - allocate kernel memory from reliable region > > - allocate user memory from non-reliable region > > > > In order to meet my requirement, ZONE_MOVABLE is useful. > > By arranging non-reliable range into ZONE_MOVABLE, > > reliable memory is only used for kernel allocations. > > > > My idea is to extend existing "kernelcore" option and > > introduces kernelcore=reliable option. By specifying > > "reliable" instead of specifying the amount of memory, > > non-reliable region will be arranged into ZONE_MOVABLE. > > It is unfortunate that the kernel presently refers to this memory as > "mirrored", but this patchset introduces the new term "reliable". I > think it would be better if we use "mirrored" throughout. > Of course, mirroring isn't the only way to get reliable memory. YES. "mirroring" is not the only way. So, in my opinion, we should change "mirrored" into "reliable" in order to match terms of UEFI 2.5 spec. > Perhaps if a part of the system memory has ECC correction then this > also can be accessed using "reliable", in which case your proposed > naming makes sense. reliable == mirrored || ecc? "reliable" is better. But, I'm willing to change "reliable" into "mirrored". Otherwise, I keep "kernelcore=reliable" and add the following minimal fix as a separate patch: diff a/arch/x86/platform/efi/efi.c b/arch/x86/platform/efi/efi.c --- a/arch/x86/platform/efi/efi.c +++ b/arch/x86/platform/efi/efi.c @@ -134,7 +134,7 @@ void __init efi_find_mirror(void) } } if (mirror_size) - pr_info("Memory: %lldM/%lldM mirrored memory\n", + pr_info("Memory: %lldM/%lldM reliable memory\n", mirror_size>>20, total_size>>20); } Which do you think is beter ? - change into kernelcore="mirrored" - keep kernelcore="reliable" and minmal printk fix > > Secondly, does this patchset mean that kernelcore=reliable and > kernelcore=100M are exclusive? Or can the user specify > "kernelcore=reliable,kernelcore=100M" to use 100M of reliable memory > for kernelcore? No, these are exclusive. > > This is unclear from the documentation and I suggest that this be > spelled out. Thanks. I'll update its document. Sincerely, Taku Izumi -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH v2 0/2] mm: Introduce kernelcore=reliable option
Dear Tony, > > Which do you think is beter ? > >- change into kernelcore="mirrored" > >- keep kernelcore="reliable" and minmal printk fix > > UEFI came up with the "reliable" wording (as a more generic term ... > as Andrew said > it could cover differences in ECC modes, or some alternate memory > technology that > has lower error rates). > > But I personally like "mirror" more ... it matches current > implementation. Of course > I'll look silly if some future system does something other than mirror. > Okay, I'll change the option name into kernelcore=mirror. Sincerely, Taku Izumi
RE: [PATCH v2 2/2] mm: Introduce kernelcore=reliable option
Dear Xishi, Thanks for reviewing. > -Original Message- > From: Xishi Qiu [mailto:qiuxi...@huawei.com] > Sent: Wednesday, December 09, 2015 11:26 AM > To: Izumi, Taku/泉 拓 > Cc: linux-kernel@vger.kernel.org; linux...@kvack.org; tony.l...@intel.com; > Kamezawa, Hiroyuki/亀澤 寛之; m...@csn.ul.ie; > a...@linux-foundation.org; dave.han...@intel.com; m...@codeblueprint.co.uk > Subject: Re: [PATCH v2 2/2] mm: Introduce kernelcore=reliable option > > On 2015/11/27 23:04, Taku Izumi wrote: > > > This patch extends existing "kernelcore" option and > > introduces kernelcore=reliable option. By specifying > > "reliable" instead of specifying the amount of memory, > > non-reliable region will be arranged into ZONE_MOVABLE. > > > > v1 -> v2: > > - Refine so that the following case also can be > >handled properly: > > > > Node X: |MM--MM| > >(legend) M: mirrored -: not mirrrored > > > > In this case, ZONE_NORMAL and ZONE_MOVABLE are > > arranged like bellow: > > > > Node X: |--| > > |ooxxxxxxoo| ZONE_NORMAL > > |ooxx| ZONE_MOVABLE > >(legend) o: present x: absent > > > > Signed-off-by: Taku Izumi <izumi.t...@jp.fujitsu.com> > > --- > > Documentation/kernel-parameters.txt | 9 ++- > > mm/page_alloc.c | 110 > > ++-- > > 2 files changed, 112 insertions(+), 7 deletions(-) > > > > diff --git a/Documentation/kernel-parameters.txt > > b/Documentation/kernel-parameters.txt > > index f8aae63..ed44c2c8 100644 > > --- a/Documentation/kernel-parameters.txt > > +++ b/Documentation/kernel-parameters.txt > > @@ -1695,7 +1695,8 @@ bytes respectively. Such letter suffixes can also be > > entirely omitted. > > > > keepinitrd [HW,ARM] > > > > - kernelcore=nn[KMG] [KNL,X86,IA-64,PPC] This parameter > > + kernelcore= Format: nn[KMG] | "reliable" > > + [KNL,X86,IA-64,PPC] This parameter > > specifies the amount of memory usable by the kernel > > for non-movable allocations. The requested amount is > > spread evenly throughout all nodes in the system. The > > @@ -1711,6 +1712,12 @@ bytes respectively. Such letter suffixes can also be > > entirely omitted. > > use the HighMem zone if it exists, and the Normal > > zone if it does not. > > > > + Instead of specifying the amount of memory (nn[KMS]), > > + you can specify "reliable" option. In case "reliable" > > + option is specified, reliable memory is used for > > + non-movable allocations and remaining memory is used > > + for Movable pages. > > + > > kgdbdbgp= [KGDB,HW] kgdb over EHCI usb debug port. > > Format: <Controller#>[,poll interval] > > The controller # is the number of the ehci usb debug > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > > index acb0b4e..006a3d8 100644 > > --- a/mm/page_alloc.c > > +++ b/mm/page_alloc.c > > @@ -251,6 +251,7 @@ static unsigned long __meminitdata > > arch_zone_highest_possible_pfn[MAX_NR_ZONES]; > > static unsigned long __initdata required_kernelcore; > > static unsigned long __initdata required_movablecore; > > static unsigned long __meminitdata zone_movable_pfn[MAX_NUMNODES]; > > +static bool reliable_kernelcore; > > > > /* movable_zone is the "real" zone pages in ZONE_MOVABLE are taken from */ > > int movable_zone; > > @@ -4472,6 +4473,7 @@ void __meminit memmap_init_zone(unsigned long size, > > int nid, unsigned long zone, > > unsigned long pfn; > > struct zone *z; > > unsigned long nr_initialised = 0; > > + struct memblock_region *r = NULL, *tmp; > > > > if (highest_memmap_pfn < end_pfn - 1) > > highest_memmap_pfn = end_pfn - 1; > > @@ -4491,6 +4493,38 @@ void __meminit memmap_init_zone(unsigned long size, > > int nid, unsigned long zone, > > if (!update_defer_init(pgdat, pfn, end_pfn, > > _initialised)) > > break; > > + > > + /* > > +* if not reliable_kernelcore and ZONE_MOVABLE exists, >
[PATCH v3 2/2] mm: Introduce kernelcore=mirror option
This patch extends existing "kernelcore" option and introduces kernelcore=mirror option. By specifying "mirror" instead of specifying the amount of memory, non-mirrored (non-reliable) region will be arranged into ZONE_MOVABLE. v1 -> v2: - Refine so that the following case also can be handled properly: Node X: |MM--MM| (legend) M: mirrored -: not mirrrored In this case, ZONE_NORMAL and ZONE_MOVABLE are arranged like bellow: Node X: |MM--MM| |ooxxoo| ZONE_NORMAL |ooxx| ZONE_MOVABLE (legend) o: present x: absent v2 -> v3: - change the option name from kernelcore=reliable into kernelcore=mirror - documentation fix so that users can understand nn[KMS] and mirror are exclusive Signed-off-by: Taku Izumi <izumi.t...@jp.fujitsu.com> --- Documentation/kernel-parameters.txt | 11 +++- mm/page_alloc.c | 110 ++-- 2 files changed, 114 insertions(+), 7 deletions(-) diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt index f8aae63..b0ffc76 100644 --- a/Documentation/kernel-parameters.txt +++ b/Documentation/kernel-parameters.txt @@ -1695,7 +1695,8 @@ bytes respectively. Such letter suffixes can also be entirely omitted. keepinitrd [HW,ARM] - kernelcore=nn[KMG] [KNL,X86,IA-64,PPC] This parameter + kernelcore= Format: nn[KMG] | "mirror" + [KNL,X86,IA-64,PPC] This parameter specifies the amount of memory usable by the kernel for non-movable allocations. The requested amount is spread evenly throughout all nodes in the system. The @@ -1711,6 +1712,14 @@ bytes respectively. Such letter suffixes can also be entirely omitted. use the HighMem zone if it exists, and the Normal zone if it does not. + Instead of specifying the amount of memory (nn[KMS]), + you can specify "mirror" option. In case "mirror" + option is specified, mirrored (reliable) memory is used + for non-movable allocations and remaining memory is used + for Movable pages. nn[KMS] and "mirror" are exclusive, + so you can NOT specify nn[KMG] and "mirror" at the same + time. + kgdbdbgp= [KGDB,HW] kgdb over EHCI usb debug port. Format: <Controller#>[,poll interval] The controller # is the number of the ehci usb debug diff --git a/mm/page_alloc.c b/mm/page_alloc.c index acb0b4e..4157476 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -251,6 +251,7 @@ static unsigned long __meminitdata arch_zone_highest_possible_pfn[MAX_NR_ZONES]; static unsigned long __initdata required_kernelcore; static unsigned long __initdata required_movablecore; static unsigned long __meminitdata zone_movable_pfn[MAX_NUMNODES]; +static bool mirrored_kernelcore; /* movable_zone is the "real" zone pages in ZONE_MOVABLE are taken from */ int movable_zone; @@ -4472,6 +4473,7 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone, unsigned long pfn; struct zone *z; unsigned long nr_initialised = 0; + struct memblock_region *r = NULL, *tmp; if (highest_memmap_pfn < end_pfn - 1) highest_memmap_pfn = end_pfn - 1; @@ -4491,6 +4493,38 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone, if (!update_defer_init(pgdat, pfn, end_pfn, _initialised)) break; + + /* +* if not mirrored_kernelcore and ZONE_MOVABLE exists, +* range from zone_movable_pfn[nid] to end of each node +* should be ZONE_MOVABLE not ZONE_NORMAL. skip it. +*/ + if (!mirrored_kernelcore && zone_movable_pfn[nid]) + if (zone == ZONE_NORMAL && + pfn >= zone_movable_pfn[nid]) + continue; + + /* +* check given memblock attribute by firmware which +* can affect kernel memory layout. +* if zone==ZONE_MOVABLE but memory is mirrored, +* it's an overlapped memmap init. skip it. +*/ + if (mirrored_kernelcore && zone == ZONE_MOVABLE) { + if (!r || +
[PATCH v3 0/2] mm: Introduce kernelcore=mirror option
Xeon E7 v3 based systems supports Address Range Mirroring and UEFI BIOS complied with UEFI spec 2.5 can notify which ranges are mirrored (reliable) via EFI memory map. Now Linux kernel utilize its information and allocates boot time memory from reliable region. My requirement is: - allocate kernel memory from mirrored region - allocate user memory from non-mirrored region In order to meet my requirement, ZONE_MOVABLE is useful. By arranging non-mirrored range into ZONE_MOVABLE, mirrored memory is used for kernel allocations. My idea is to extend existing "kernelcore" option and introduces kernelcore=mirror option. By specifying "mirror" instead of specifying the amount of memory, non-mirrored region will be arranged into ZONE_MOVABLE. Earlier discussions are at: https://lkml.org/lkml/2015/10/9/24 https://lkml.org/lkml/2015/10/15/9 https://lkml.org/lkml/2015/11/27/18 For example, suppose 2-nodes system with the following memory range: node 0 [mem 0x1000-0x00109fff] node 1 [mem 0x0010a000-0x00209fff] and the following ranges are marked as reliable (mirrored): [0x-0x0001] [0x0001-0x00018000] [0x0008-0x00088000] [0x0010a000-0x00112000] [0x0017a000-0x00182000] If you specify kernelcore=mirror, ZONE_NORMAL and ZONE_MOVABLE are arranged like bellow: - node 0: ZONE_NORMAL : [0x0001-0x0010a000] ZONE_MOVABLE: [0x00018000-0x0010a000] - node 1: ZONE_NORMAL : [0x0010a000-0x0020a000] ZONE_MOVABLE: [0x00112000-0x0020a000] In overlapped range, pages to be ZONE_MOVABLE in ZONE_NORMAL are treated as absent pages, and vice versa. v1 -> v2: Refine so that the above example case also can be handled properly: v2 -> v3: Change the option name from kernelcore=reliable into kernelcore=mirror and some documentation fix according to Andrew Morton's point Taku Izumi (2): mm: Calculate zone_start_pfn at zone_spanned_pages_in_node() mm: Introduce kernelcore=mirror option Documentation/kernel-parameters.txt | 11 ++- mm/page_alloc.c | 140 +++- 2 files changed, 133 insertions(+), 18 deletions(-) -- 1.9.1 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v3 1/2] mm: Calculate zone_start_pfn at zone_spanned_pages_in_node()
Currently each zone's zone_start_pfn is calculated at free_area_init_core(). However zone's range is fixed at the time when invoking zone_spanned_pages_in_node(). This patch changes each zone->zone_start_pfn is calculated at zone_spanned_pages_in_node(). Signed-off-by: Taku Izumi <izumi.t...@jp.fujitsu.com> --- mm/page_alloc.c | 30 +++--- 1 file changed, 19 insertions(+), 11 deletions(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 17a3c66..acb0b4e 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -4928,31 +4928,31 @@ static unsigned long __meminit zone_spanned_pages_in_node(int nid, unsigned long zone_type, unsigned long node_start_pfn, unsigned long node_end_pfn, + unsigned long *zone_start_pfn, + unsigned long *zone_end_pfn, unsigned long *ignored) { - unsigned long zone_start_pfn, zone_end_pfn; - /* When hotadd a new node from cpu_up(), the node should be empty */ if (!node_start_pfn && !node_end_pfn) return 0; /* Get the start and end of the zone */ - zone_start_pfn = arch_zone_lowest_possible_pfn[zone_type]; - zone_end_pfn = arch_zone_highest_possible_pfn[zone_type]; + *zone_start_pfn = arch_zone_lowest_possible_pfn[zone_type]; + *zone_end_pfn = arch_zone_highest_possible_pfn[zone_type]; adjust_zone_range_for_zone_movable(nid, zone_type, node_start_pfn, node_end_pfn, - _start_pfn, _end_pfn); + zone_start_pfn, zone_end_pfn); /* Check that this node has pages within the zone's required range */ - if (zone_end_pfn < node_start_pfn || zone_start_pfn > node_end_pfn) + if (*zone_end_pfn < node_start_pfn || *zone_start_pfn > node_end_pfn) return 0; /* Move the zone boundaries inside the node if necessary */ - zone_end_pfn = min(zone_end_pfn, node_end_pfn); - zone_start_pfn = max(zone_start_pfn, node_start_pfn); + *zone_end_pfn = min(*zone_end_pfn, node_end_pfn); + *zone_start_pfn = max(*zone_start_pfn, node_start_pfn); /* Return the spanned pages */ - return zone_end_pfn - zone_start_pfn; + return *zone_end_pfn - *zone_start_pfn; } /* @@ -5017,6 +5017,8 @@ static inline unsigned long __meminit zone_spanned_pages_in_node(int nid, unsigned long zone_type, unsigned long node_start_pfn, unsigned long node_end_pfn, + unsigned long *zone_start_pfn, + unsigned long *zone_end_pfn, unsigned long *zones_size) { return zones_size[zone_type]; @@ -5047,15 +5049,22 @@ static void __meminit calculate_node_totalpages(struct pglist_data *pgdat, for (i = 0; i < MAX_NR_ZONES; i++) { struct zone *zone = pgdat->node_zones + i; + unsigned long zone_start_pfn, zone_end_pfn; unsigned long size, real_size; size = zone_spanned_pages_in_node(pgdat->node_id, i, node_start_pfn, node_end_pfn, + _start_pfn, + _end_pfn, zones_size); real_size = size - zone_absent_pages_in_node(pgdat->node_id, i, node_start_pfn, node_end_pfn, zholes_size); + if (size) + zone->zone_start_pfn = zone_start_pfn; + else + zone->zone_start_pfn = 0; zone->spanned_pages = size; zone->present_pages = real_size; @@ -5176,7 +5185,6 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat) { enum zone_type j; int nid = pgdat->node_id; - unsigned long zone_start_pfn = pgdat->node_start_pfn; int ret; pgdat_resize_init(pgdat); @@ -5192,6 +5200,7 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat) for (j = 0; j < MAX_NR_ZONES; j++) { struct zone *zone = pgdat->node_zones + j; unsigned long size, realsize, freesize, memmap_pages; + unsigned long zone_start_pfn = zone->zone_start_pfn; size = zone->spanned_pages; reals
[PATCH v2 0/2] mm: Introduce kernelcore=reliable option
Xeon E7 v3 based systems supports Address Range Mirroring and UEFI BIOS complied with UEFI spec 2.5 can notify which ranges are reliable (mirrored) via EFI memory map. Now Linux kernel utilize its information and allocates boot time memory from reliable region. My requirement is: - allocate kernel memory from reliable region - allocate user memory from non-reliable region In order to meet my requirement, ZONE_MOVABLE is useful. By arranging non-reliable range into ZONE_MOVABLE, reliable memory is only used for kernel allocations. My idea is to extend existing "kernelcore" option and introduces kernelcore=reliable option. By specifying "reliable" instead of specifying the amount of memory, non-reliable region will be arranged into ZONE_MOVABLE. Earlier discussions are at: https://lkml.org/lkml/2015/10/9/24 https://lkml.org/lkml/2015/10/15/9 For example, suppose 2-nodes system with the following memory range: node 0 [mem 0x1000-0x00109fff] node 1 [mem 0x0010a000-0x00209fff] and the following ranges are marked as reliable: [0x-0x0001] [0x0001-0x00018000] [0x0008-0x00088000] [0x0010a000-0x00112000] [0x0017a000-0x00182000] If you specify kernelcore=reliable, ZONE_NORMAL and ZONE_MOVABLE are arranged like bellow: - node 0: ZONE_NORMAL : [0x0001-0x0010a000] ZONE_MOVABLE: [0x00018000-0x0010a000] - node 1: ZONE_NORMAL : [0x0010a000-0x0020a000] ZONE_MOVABLE: [0x00112000-0x0020a000] In overlapped range, pages to be ZONE_MOVABLE in ZONE_NORMAL are treated as absent pages, and vice versa. v1 -> v2: Refine so that the above example case also can be handled properly: Taku Izumi (2): mm: Calculate zone_start_pfn at zone_spanned_pages_in_node() mm: Introduce kernelcore=reliable option Documentation/kernel-parameters.txt | 9 ++- mm/page_alloc.c | 140 +++- 2 files changed, 131 insertions(+), 18 deletions(-) -- 1.8.3.1 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v2 1/2] mm: Calculate zone_start_pfn at zone_spanned_pages_in_node()
Currently each zone's zone_start_pfn is calculated at free_area_init_core(). However zone's range is fixed at the time when invoking zone_spanned_pages_in_node(). This patch changes each zone->zone_start_pfn is calculated at zone_spanned_pages_in_node(). Signed-off-by: Taku Izumi --- mm/page_alloc.c | 30 +++--- 1 file changed, 19 insertions(+), 11 deletions(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 17a3c66..acb0b4e 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -4928,31 +4928,31 @@ static unsigned long __meminit zone_spanned_pages_in_node(int nid, unsigned long zone_type, unsigned long node_start_pfn, unsigned long node_end_pfn, + unsigned long *zone_start_pfn, + unsigned long *zone_end_pfn, unsigned long *ignored) { - unsigned long zone_start_pfn, zone_end_pfn; - /* When hotadd a new node from cpu_up(), the node should be empty */ if (!node_start_pfn && !node_end_pfn) return 0; /* Get the start and end of the zone */ - zone_start_pfn = arch_zone_lowest_possible_pfn[zone_type]; - zone_end_pfn = arch_zone_highest_possible_pfn[zone_type]; + *zone_start_pfn = arch_zone_lowest_possible_pfn[zone_type]; + *zone_end_pfn = arch_zone_highest_possible_pfn[zone_type]; adjust_zone_range_for_zone_movable(nid, zone_type, node_start_pfn, node_end_pfn, - _start_pfn, _end_pfn); + zone_start_pfn, zone_end_pfn); /* Check that this node has pages within the zone's required range */ - if (zone_end_pfn < node_start_pfn || zone_start_pfn > node_end_pfn) + if (*zone_end_pfn < node_start_pfn || *zone_start_pfn > node_end_pfn) return 0; /* Move the zone boundaries inside the node if necessary */ - zone_end_pfn = min(zone_end_pfn, node_end_pfn); - zone_start_pfn = max(zone_start_pfn, node_start_pfn); + *zone_end_pfn = min(*zone_end_pfn, node_end_pfn); + *zone_start_pfn = max(*zone_start_pfn, node_start_pfn); /* Return the spanned pages */ - return zone_end_pfn - zone_start_pfn; + return *zone_end_pfn - *zone_start_pfn; } /* @@ -5017,6 +5017,8 @@ static inline unsigned long __meminit zone_spanned_pages_in_node(int nid, unsigned long zone_type, unsigned long node_start_pfn, unsigned long node_end_pfn, + unsigned long *zone_start_pfn, + unsigned long *zone_end_pfn, unsigned long *zones_size) { return zones_size[zone_type]; @@ -5047,15 +5049,22 @@ static void __meminit calculate_node_totalpages(struct pglist_data *pgdat, for (i = 0; i < MAX_NR_ZONES; i++) { struct zone *zone = pgdat->node_zones + i; + unsigned long zone_start_pfn, zone_end_pfn; unsigned long size, real_size; size = zone_spanned_pages_in_node(pgdat->node_id, i, node_start_pfn, node_end_pfn, + _start_pfn, + _end_pfn, zones_size); real_size = size - zone_absent_pages_in_node(pgdat->node_id, i, node_start_pfn, node_end_pfn, zholes_size); + if (size) + zone->zone_start_pfn = zone_start_pfn; + else + zone->zone_start_pfn = 0; zone->spanned_pages = size; zone->present_pages = real_size; @@ -5176,7 +5185,6 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat) { enum zone_type j; int nid = pgdat->node_id; - unsigned long zone_start_pfn = pgdat->node_start_pfn; int ret; pgdat_resize_init(pgdat); @@ -5192,6 +5200,7 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat) for (j = 0; j < MAX_NR_ZONES; j++) { struct zone *zone = pgdat->node_zones + j; unsigned long size, realsize, freesize, memmap_pages; + unsigned long zone_start_pfn = zone->zone_start_pfn; size = zone->spanned_pages; realsize = freesize = zone->present_pages
[PATCH v2 2/2] mm: Introduce kernelcore=reliable option
This patch extends existing "kernelcore" option and introduces kernelcore=reliable option. By specifying "reliable" instead of specifying the amount of memory, non-reliable region will be arranged into ZONE_MOVABLE. v1 -> v2: - Refine so that the following case also can be handled properly: Node X: |MM--MM| (legend) M: mirrored -: not mirrrored In this case, ZONE_NORMAL and ZONE_MOVABLE are arranged like bellow: Node X: |--| |ooxxoo| ZONE_NORMAL |ooxx| ZONE_MOVABLE (legend) o: present x: absent Signed-off-by: Taku Izumi --- Documentation/kernel-parameters.txt | 9 ++- mm/page_alloc.c | 110 ++-- 2 files changed, 112 insertions(+), 7 deletions(-) diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt index f8aae63..ed44c2c8 100644 --- a/Documentation/kernel-parameters.txt +++ b/Documentation/kernel-parameters.txt @@ -1695,7 +1695,8 @@ bytes respectively. Such letter suffixes can also be entirely omitted. keepinitrd [HW,ARM] - kernelcore=nn[KMG] [KNL,X86,IA-64,PPC] This parameter + kernelcore= Format: nn[KMG] | "reliable" + [KNL,X86,IA-64,PPC] This parameter specifies the amount of memory usable by the kernel for non-movable allocations. The requested amount is spread evenly throughout all nodes in the system. The @@ -1711,6 +1712,12 @@ bytes respectively. Such letter suffixes can also be entirely omitted. use the HighMem zone if it exists, and the Normal zone if it does not. + Instead of specifying the amount of memory (nn[KMS]), + you can specify "reliable" option. In case "reliable" + option is specified, reliable memory is used for + non-movable allocations and remaining memory is used + for Movable pages. + kgdbdbgp= [KGDB,HW] kgdb over EHCI usb debug port. Format: [,poll interval] The controller # is the number of the ehci usb debug diff --git a/mm/page_alloc.c b/mm/page_alloc.c index acb0b4e..006a3d8 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -251,6 +251,7 @@ static unsigned long __meminitdata arch_zone_highest_possible_pfn[MAX_NR_ZONES]; static unsigned long __initdata required_kernelcore; static unsigned long __initdata required_movablecore; static unsigned long __meminitdata zone_movable_pfn[MAX_NUMNODES]; +static bool reliable_kernelcore; /* movable_zone is the "real" zone pages in ZONE_MOVABLE are taken from */ int movable_zone; @@ -4472,6 +4473,7 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone, unsigned long pfn; struct zone *z; unsigned long nr_initialised = 0; + struct memblock_region *r = NULL, *tmp; if (highest_memmap_pfn < end_pfn - 1) highest_memmap_pfn = end_pfn - 1; @@ -4491,6 +4493,38 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone, if (!update_defer_init(pgdat, pfn, end_pfn, _initialised)) break; + + /* +* if not reliable_kernelcore and ZONE_MOVABLE exists, +* range from zone_movable_pfn[nid] to end of each node +* should be ZONE_MOVABLE not ZONE_NORMAL. skip it. +*/ + if (!reliable_kernelcore && zone_movable_pfn[nid]) + if (zone == ZONE_NORMAL && + pfn >= zone_movable_pfn[nid]) + continue; + + /* +* check given memblock attribute by firmware which +* can affect kernel memory layout. +* if zone==ZONE_MOVABLE but memory is mirrored, +* it's an overlapped memmap init. skip it. +*/ + if (reliable_kernelcore && zone == ZONE_MOVABLE) { + if (!r || + pfn >= memblock_region_memory_end_pfn(r)) { + for_each_memblock(memory, tmp) + if (pfn < memblock_region_memory_end_pfn(tmp)) + break; + r = tmp; +
[PATCH v2 0/2] mm: Introduce kernelcore=reliable option
Xeon E7 v3 based systems supports Address Range Mirroring and UEFI BIOS complied with UEFI spec 2.5 can notify which ranges are reliable (mirrored) via EFI memory map. Now Linux kernel utilize its information and allocates boot time memory from reliable region. My requirement is: - allocate kernel memory from reliable region - allocate user memory from non-reliable region In order to meet my requirement, ZONE_MOVABLE is useful. By arranging non-reliable range into ZONE_MOVABLE, reliable memory is only used for kernel allocations. My idea is to extend existing "kernelcore" option and introduces kernelcore=reliable option. By specifying "reliable" instead of specifying the amount of memory, non-reliable region will be arranged into ZONE_MOVABLE. Earlier discussions are at: https://lkml.org/lkml/2015/10/9/24 https://lkml.org/lkml/2015/10/15/9 For example, suppose 2-nodes system with the following memory range: node 0 [mem 0x1000-0x00109fff] node 1 [mem 0x0010a000-0x00209fff] and the following ranges are marked as reliable: [0x-0x0001] [0x0001-0x00018000] [0x0008-0x00088000] [0x0010a000-0x00112000] [0x0017a000-0x00182000] If you specify kernelcore=reliable, ZONE_NORMAL and ZONE_MOVABLE are arranged like bellow: - node 0: ZONE_NORMAL : [0x0001-0x0010a000] ZONE_MOVABLE: [0x00018000-0x0010a000] - node 1: ZONE_NORMAL : [0x0010a000-0x0020a000] ZONE_MOVABLE: [0x00112000-0x0020a000] In overlapped range, pages to be ZONE_MOVABLE in ZONE_NORMAL are treated as absent pages, and vice versa. v1 -> v2: Refine so that the above example case also can be handled properly: Taku Izumi (2): mm: Calculate zone_start_pfn at zone_spanned_pages_in_node() mm: Introduce kernelcore=reliable option Documentation/kernel-parameters.txt | 9 ++- mm/page_alloc.c | 140 +++- 2 files changed, 131 insertions(+), 18 deletions(-) -- 1.8.3.1 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v2 1/2] mm: Calculate zone_start_pfn at zone_spanned_pages_in_node()
Currently each zone's zone_start_pfn is calculated at free_area_init_core(). However zone's range is fixed at the time when invoking zone_spanned_pages_in_node(). This patch changes each zone->zone_start_pfn is calculated at zone_spanned_pages_in_node(). Signed-off-by: Taku Izumi <izumi.t...@jp.fujitsu.com> --- mm/page_alloc.c | 30 +++--- 1 file changed, 19 insertions(+), 11 deletions(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 17a3c66..acb0b4e 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -4928,31 +4928,31 @@ static unsigned long __meminit zone_spanned_pages_in_node(int nid, unsigned long zone_type, unsigned long node_start_pfn, unsigned long node_end_pfn, + unsigned long *zone_start_pfn, + unsigned long *zone_end_pfn, unsigned long *ignored) { - unsigned long zone_start_pfn, zone_end_pfn; - /* When hotadd a new node from cpu_up(), the node should be empty */ if (!node_start_pfn && !node_end_pfn) return 0; /* Get the start and end of the zone */ - zone_start_pfn = arch_zone_lowest_possible_pfn[zone_type]; - zone_end_pfn = arch_zone_highest_possible_pfn[zone_type]; + *zone_start_pfn = arch_zone_lowest_possible_pfn[zone_type]; + *zone_end_pfn = arch_zone_highest_possible_pfn[zone_type]; adjust_zone_range_for_zone_movable(nid, zone_type, node_start_pfn, node_end_pfn, - _start_pfn, _end_pfn); + zone_start_pfn, zone_end_pfn); /* Check that this node has pages within the zone's required range */ - if (zone_end_pfn < node_start_pfn || zone_start_pfn > node_end_pfn) + if (*zone_end_pfn < node_start_pfn || *zone_start_pfn > node_end_pfn) return 0; /* Move the zone boundaries inside the node if necessary */ - zone_end_pfn = min(zone_end_pfn, node_end_pfn); - zone_start_pfn = max(zone_start_pfn, node_start_pfn); + *zone_end_pfn = min(*zone_end_pfn, node_end_pfn); + *zone_start_pfn = max(*zone_start_pfn, node_start_pfn); /* Return the spanned pages */ - return zone_end_pfn - zone_start_pfn; + return *zone_end_pfn - *zone_start_pfn; } /* @@ -5017,6 +5017,8 @@ static inline unsigned long __meminit zone_spanned_pages_in_node(int nid, unsigned long zone_type, unsigned long node_start_pfn, unsigned long node_end_pfn, + unsigned long *zone_start_pfn, + unsigned long *zone_end_pfn, unsigned long *zones_size) { return zones_size[zone_type]; @@ -5047,15 +5049,22 @@ static void __meminit calculate_node_totalpages(struct pglist_data *pgdat, for (i = 0; i < MAX_NR_ZONES; i++) { struct zone *zone = pgdat->node_zones + i; + unsigned long zone_start_pfn, zone_end_pfn; unsigned long size, real_size; size = zone_spanned_pages_in_node(pgdat->node_id, i, node_start_pfn, node_end_pfn, + _start_pfn, + _end_pfn, zones_size); real_size = size - zone_absent_pages_in_node(pgdat->node_id, i, node_start_pfn, node_end_pfn, zholes_size); + if (size) + zone->zone_start_pfn = zone_start_pfn; + else + zone->zone_start_pfn = 0; zone->spanned_pages = size; zone->present_pages = real_size; @@ -5176,7 +5185,6 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat) { enum zone_type j; int nid = pgdat->node_id; - unsigned long zone_start_pfn = pgdat->node_start_pfn; int ret; pgdat_resize_init(pgdat); @@ -5192,6 +5200,7 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat) for (j = 0; j < MAX_NR_ZONES; j++) { struct zone *zone = pgdat->node_zones + j; unsigned long size, realsize, freesize, memmap_pages; + unsigned long zone_start_pfn = zone->zone_start_pfn; size = zone->spanned_pages; reals
[PATCH v2 2/2] mm: Introduce kernelcore=reliable option
This patch extends existing "kernelcore" option and introduces kernelcore=reliable option. By specifying "reliable" instead of specifying the amount of memory, non-reliable region will be arranged into ZONE_MOVABLE. v1 -> v2: - Refine so that the following case also can be handled properly: Node X: |MM--MM| (legend) M: mirrored -: not mirrrored In this case, ZONE_NORMAL and ZONE_MOVABLE are arranged like bellow: Node X: |--| |ooxxoo| ZONE_NORMAL |ooxx| ZONE_MOVABLE (legend) o: present x: absent Signed-off-by: Taku Izumi <izumi.t...@jp.fujitsu.com> --- Documentation/kernel-parameters.txt | 9 ++- mm/page_alloc.c | 110 ++-- 2 files changed, 112 insertions(+), 7 deletions(-) diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt index f8aae63..ed44c2c8 100644 --- a/Documentation/kernel-parameters.txt +++ b/Documentation/kernel-parameters.txt @@ -1695,7 +1695,8 @@ bytes respectively. Such letter suffixes can also be entirely omitted. keepinitrd [HW,ARM] - kernelcore=nn[KMG] [KNL,X86,IA-64,PPC] This parameter + kernelcore= Format: nn[KMG] | "reliable" + [KNL,X86,IA-64,PPC] This parameter specifies the amount of memory usable by the kernel for non-movable allocations. The requested amount is spread evenly throughout all nodes in the system. The @@ -1711,6 +1712,12 @@ bytes respectively. Such letter suffixes can also be entirely omitted. use the HighMem zone if it exists, and the Normal zone if it does not. + Instead of specifying the amount of memory (nn[KMS]), + you can specify "reliable" option. In case "reliable" + option is specified, reliable memory is used for + non-movable allocations and remaining memory is used + for Movable pages. + kgdbdbgp= [KGDB,HW] kgdb over EHCI usb debug port. Format: <Controller#>[,poll interval] The controller # is the number of the ehci usb debug diff --git a/mm/page_alloc.c b/mm/page_alloc.c index acb0b4e..006a3d8 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -251,6 +251,7 @@ static unsigned long __meminitdata arch_zone_highest_possible_pfn[MAX_NR_ZONES]; static unsigned long __initdata required_kernelcore; static unsigned long __initdata required_movablecore; static unsigned long __meminitdata zone_movable_pfn[MAX_NUMNODES]; +static bool reliable_kernelcore; /* movable_zone is the "real" zone pages in ZONE_MOVABLE are taken from */ int movable_zone; @@ -4472,6 +4473,7 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone, unsigned long pfn; struct zone *z; unsigned long nr_initialised = 0; + struct memblock_region *r = NULL, *tmp; if (highest_memmap_pfn < end_pfn - 1) highest_memmap_pfn = end_pfn - 1; @@ -4491,6 +4493,38 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone, if (!update_defer_init(pgdat, pfn, end_pfn, _initialised)) break; + + /* +* if not reliable_kernelcore and ZONE_MOVABLE exists, +* range from zone_movable_pfn[nid] to end of each node +* should be ZONE_MOVABLE not ZONE_NORMAL. skip it. +*/ + if (!reliable_kernelcore && zone_movable_pfn[nid]) + if (zone == ZONE_NORMAL && + pfn >= zone_movable_pfn[nid]) + continue; + + /* +* check given memblock attribute by firmware which +* can affect kernel memory layout. +* if zone==ZONE_MOVABLE but memory is mirrored, +* it's an overlapped memmap init. skip it. +*/ + if (reliable_kernelcore && zone == ZONE_MOVABLE) { + if (!r || + pfn >= memblock_region_memory_end_pfn(r)) { + for_each_memblock(memory, tmp) + if (pfn < memblock_region_memory_end_pfn(tmp)) + break; + r = tmp;
RE: [PATCH] fjes: fix inconsistent indenting
Thanks, Colin. Signed-off-by: Taku Izumi > -Original Message- > From: Colin King [mailto:colin.k...@canonical.com] > Sent: Thursday, November 12, 2015 12:23 AM > To: David S. Miller; Izumi, Taku/泉 拓; Markus Elfring; net...@vger.kernel.org > Cc: linux-kernel@vger.kernel.org > Subject: [PATCH] fjes: fix inconsistent indenting > > From: Colin Ian King > > minor change, indenting is one tab out. > > Signed-off-by: Colin Ian King > --- > drivers/net/fjes/fjes_hw.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/drivers/net/fjes/fjes_hw.c b/drivers/net/fjes/fjes_hw.c > index bb8b530..b103adb 100644 > --- a/drivers/net/fjes/fjes_hw.c > +++ b/drivers/net/fjes/fjes_hw.c > @@ -599,7 +599,7 @@ int fjes_hw_unregister_buff_addr(struct fjes_hw *hw, int > dest_epid) > FJES_CMD_REQ_RES_CODE_BUSY) && > (timeout > 0)) { > msleep(200 + hw->my_epid * 20); > - timeout -= (200 + hw->my_epid * 20); > + timeout -= (200 + hw->my_epid * 20); > > res_buf->unshare_buffer.length = 0; > res_buf->unshare_buffer.code = 0; > -- > 2.5.0 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH] fjes: fix inconsistent indenting
Thanks, Colin. Signed-off-by: Taku Izumi <izumi.t...@jp.fujitsu.com> > -Original Message- > From: Colin King [mailto:colin.k...@canonical.com] > Sent: Thursday, November 12, 2015 12:23 AM > To: David S. Miller; Izumi, Taku/泉 拓; Markus Elfring; net...@vger.kernel.org > Cc: linux-kernel@vger.kernel.org > Subject: [PATCH] fjes: fix inconsistent indenting > > From: Colin Ian King <colin.k...@canonical.com> > > minor change, indenting is one tab out. > > Signed-off-by: Colin Ian King <colin.k...@canonical.com> > --- > drivers/net/fjes/fjes_hw.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/drivers/net/fjes/fjes_hw.c b/drivers/net/fjes/fjes_hw.c > index bb8b530..b103adb 100644 > --- a/drivers/net/fjes/fjes_hw.c > +++ b/drivers/net/fjes/fjes_hw.c > @@ -599,7 +599,7 @@ int fjes_hw_unregister_buff_addr(struct fjes_hw *hw, int > dest_epid) > FJES_CMD_REQ_RES_CODE_BUSY) && > (timeout > 0)) { > msleep(200 + hw->my_epid * 20); > - timeout -= (200 + hw->my_epid * 20); > + timeout -= (200 + hw->my_epid * 20); > > res_buf->unshare_buffer.length = 0; > res_buf->unshare_buffer.code = 0; > -- > 2.5.0 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[tip:core/efi] efi: Fix warning of int-to-pointer-cast on x86 32-bit builds
Commit-ID: 78b9bc947b18ed16b6c2c573d774e6d54ad9452d Gitweb: http://git.kernel.org/tip/78b9bc947b18ed16b6c2c573d774e6d54ad9452d Author: Taku Izumi AuthorDate: Fri, 23 Oct 2015 11:48:17 +0200 Committer: Ingo Molnar CommitDate: Wed, 28 Oct 2015 12:28:06 +0100 efi: Fix warning of int-to-pointer-cast on x86 32-bit builds Commit: 0f96a99dab36 ("efi: Add "efi_fake_mem" boot option") introduced the following warning message: drivers/firmware/efi/fake_mem.c:186:20: warning: cast to pointer from integer of different size [-Wint-to-pointer-cast] new_memmap_phy was defined as a u64 value and cast to void*, causing a int-to-pointer-cast warning on x86 32-bit builds. However, since the void* type is inappropriate for a physical address, the definition of struct efi_memory_map::phys_map has been changed to phys_addr_t in the previous patch, and so the cast can be dropped entirely. This patch also changes the type of the "new_memmap_phy" variable from "u64" to "phys_addr_t" to align with the types of memblock_alloc() and struct efi_memory_map::phys_map. Reported-by: Ingo Molnar Signed-off-by: Taku Izumi [ Removed void* cast, updated commit log] Signed-off-by: Ard Biesheuvel Reviewed-by: Matt Fleming Cc: Linus Torvalds Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: kamezawa.hir...@jp.fujitsu.com Cc: linux-...@vger.kernel.org Cc: matt.flem...@intel.com Link: http://lkml.kernel.org/r/1445593697-1342-2-git-send-email-ard.biesheu...@linaro.org Signed-off-by: Ingo Molnar --- drivers/firmware/efi/fake_mem.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/firmware/efi/fake_mem.c b/drivers/firmware/efi/fake_mem.c index 32bcb14..ed3a854 100644 --- a/drivers/firmware/efi/fake_mem.c +++ b/drivers/firmware/efi/fake_mem.c @@ -59,7 +59,7 @@ void __init efi_fake_memmap(void) u64 start, end, m_start, m_end, m_attr; int new_nr_map = memmap.nr_map; efi_memory_desc_t *md; - u64 new_memmap_phy; + phys_addr_t new_memmap_phy; void *new_memmap; void *old, *new; int i; @@ -183,7 +183,7 @@ void __init efi_fake_memmap(void) /* swap into new EFI memmap */ efi_unmap_memmap(); memmap.map = new_memmap; - memmap.phys_map = (void *)new_memmap_phy; + memmap.phys_map = new_memmap_phy; memmap.nr_map = new_nr_map; memmap.map_end = memmap.map + memmap.nr_map * memmap.desc_size; set_bit(EFI_MEMMAP, ); -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[tip:core/efi] efi: Fix warning of int-to-pointer-cast on x86 32-bit builds
Commit-ID: 78b9bc947b18ed16b6c2c573d774e6d54ad9452d Gitweb: http://git.kernel.org/tip/78b9bc947b18ed16b6c2c573d774e6d54ad9452d Author: Taku Izumi <izumi.t...@jp.fujitsu.com> AuthorDate: Fri, 23 Oct 2015 11:48:17 +0200 Committer: Ingo Molnar <mi...@kernel.org> CommitDate: Wed, 28 Oct 2015 12:28:06 +0100 efi: Fix warning of int-to-pointer-cast on x86 32-bit builds Commit: 0f96a99dab36 ("efi: Add "efi_fake_mem" boot option") introduced the following warning message: drivers/firmware/efi/fake_mem.c:186:20: warning: cast to pointer from integer of different size [-Wint-to-pointer-cast] new_memmap_phy was defined as a u64 value and cast to void*, causing a int-to-pointer-cast warning on x86 32-bit builds. However, since the void* type is inappropriate for a physical address, the definition of struct efi_memory_map::phys_map has been changed to phys_addr_t in the previous patch, and so the cast can be dropped entirely. This patch also changes the type of the "new_memmap_phy" variable from "u64" to "phys_addr_t" to align with the types of memblock_alloc() and struct efi_memory_map::phys_map. Reported-by: Ingo Molnar <mi...@kernel.org> Signed-off-by: Taku Izumi <izumi.t...@jp.fujitsu.com> [ Removed void* cast, updated commit log] Signed-off-by: Ard Biesheuvel <ard.biesheu...@linaro.org> Reviewed-by: Matt Fleming <m...@codeblueprint.co.uk> Cc: Linus Torvalds <torva...@linux-foundation.org> Cc: Peter Zijlstra <pet...@infradead.org> Cc: Thomas Gleixner <t...@linutronix.de> Cc: kamezawa.hir...@jp.fujitsu.com Cc: linux-...@vger.kernel.org Cc: matt.flem...@intel.com Link: http://lkml.kernel.org/r/1445593697-1342-2-git-send-email-ard.biesheu...@linaro.org Signed-off-by: Ingo Molnar <mi...@kernel.org> --- drivers/firmware/efi/fake_mem.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/firmware/efi/fake_mem.c b/drivers/firmware/efi/fake_mem.c index 32bcb14..ed3a854 100644 --- a/drivers/firmware/efi/fake_mem.c +++ b/drivers/firmware/efi/fake_mem.c @@ -59,7 +59,7 @@ void __init efi_fake_memmap(void) u64 start, end, m_start, m_end, m_attr; int new_nr_map = memmap.nr_map; efi_memory_desc_t *md; - u64 new_memmap_phy; + phys_addr_t new_memmap_phy; void *new_memmap; void *old, *new; int i; @@ -183,7 +183,7 @@ void __init efi_fake_memmap(void) /* swap into new EFI memmap */ efi_unmap_memmap(); memmap.map = new_memmap; - memmap.phys_map = (void *)new_memmap_phy; + memmap.phys_map = new_memmap_phy; memmap.nr_map = new_nr_map; memmap.map_end = memmap.map + memmap.nr_map * memmap.desc_size; set_bit(EFI_MEMMAP, ); -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH] mm: Introduce kernelcore=reliable option
Dear Tony, > -Original Message- > From: Luck, Tony [mailto:tony.l...@intel.com] > Sent: Friday, October 23, 2015 8:27 AM > To: Kamezawa, Hiroyuki/亀澤 寛之; Izumi, Taku/泉 拓; linux-kernel@vger.kernel.org; > linux...@kvack.org > Cc: qiuxi...@huawei.com; m...@csn.ul.ie; a...@linux-foundation.org; Hansen, > Dave; m...@codeblueprint.co.uk > Subject: RE: [PATCH] mm: Introduce kernelcore=reliable option > > > I think /proc/zoneinfo can show detailed numbers per zone. Do we need some > > for meminfo ? > > I wrote a little script (attached) to summarize /proc/zoneinfo ... on my > system it says > > $ zoneinfo > Node Normal Movable DMA DMA32 >00.00 103020.078.94 1554.46 >1 9284.5489870.43 >2 9626.3394050.09 >3 9602.8293650.04 > > Not sure why I have zero Normal memory free on node0. The sum of all those > free counts is 410667.72 MB ... which is close enough to the boot time message > showing the amount of mirror/total memory: > > [0.00] efi: Memory: 80979/420096M mirrored memory > > but a fair amount of the 80G of mirrored memory seems to have been miscounted > as Movable instead of Normal. Perhaps this is because I have two blocks of > mirrored > memory on each node and the movable zone code doesn't expect that? You were saying that OS view of memory of node is something like the following ? Node X: |MM--MM| (legend) M: mirrored -: not mirrrored If so, is this a real Box's configuration? Sorry, I haven't got a real Address Range Mirror capable boxes yet ... I thought mirroring range is concatenated at the first part of each node. Sincerely, Taku Izumi -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v2] efi: Fix warning of int-to-pointer-cast on x86 32-bit builds
commit-0f96a99 introduces the following warning message: drivers/firmware/efi/fake_mem.c:186:20: warning: cast to pointer from integer of different size [-Wint-to-pointer-cast] new_memmap_phy was defined as a u64 value and casted to void*. This causes a warning of int-to-pointer-cast on x86 32-bit environment. This patch changes the type of "new_memmap_phy" variable from "u64" into "ulong" to avoid it. v1 -> v2: - change the type of "new_memmap_phy" from phys_addr_t into ulong according to Ard's comment Reported-by: Ingo Molnar Signed-off-by: Taku Izumi --- drivers/firmware/efi/fake_mem.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/firmware/efi/fake_mem.c b/drivers/firmware/efi/fake_mem.c index 32bcb14..1f483b4 100644 --- a/drivers/firmware/efi/fake_mem.c +++ b/drivers/firmware/efi/fake_mem.c @@ -59,7 +59,7 @@ void __init efi_fake_memmap(void) u64 start, end, m_start, m_end, m_attr; int new_nr_map = memmap.nr_map; efi_memory_desc_t *md; - u64 new_memmap_phy; + ulong new_memmap_phy; void *new_memmap; void *old, *new; int i; -- 1.8.3.1 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH] efi: Fix warning of int-to-pointer-cast on x86 32-bit builds
Dear Ard, > > commit-0f96a99 introduces the following warning message: > > > > drivers/firmware/efi/fake_mem.c:186:20: warning: cast to pointer > > from integer of different size [-Wint-to-pointer-cast] > > > > new_memmap_phy was defined as a u64 value and casted to void*. > > This causes a warning of int-to-pointer-cast on x86 32-bit > > environment. > > > > This patch changes the type of "new_memmap_phy" variable > > from "u64" into "phys_addr_t" to avoid it. > > This assumes sizeof(void*) == sizeof(phys_addr_t), which is not always true, > e.g., on 32-bit ARM (whose UEFI support is > in development but not yet merged) with LPAE enabled. > > Could we use unsigned long instead? Okay. I'll update my patch. Sincerely, Taku Izumi -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] efi: Fix warning of int-to-pointer-cast on x86 32-bit builds
commit-0f96a99 introduces the following warning message: drivers/firmware/efi/fake_mem.c:186:20: warning: cast to pointer from integer of different size [-Wint-to-pointer-cast] new_memmap_phy was defined as a u64 value and casted to void*. This causes a warning of int-to-pointer-cast on x86 32-bit environment. This patch changes the type of "new_memmap_phy" variable from "u64" into "phys_addr_t" to avoid it. Reported-by: Ingo Molnar Signed-off-by: Taku Izumi --- drivers/firmware/efi/fake_mem.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/firmware/efi/fake_mem.c b/drivers/firmware/efi/fake_mem.c index 32bcb14..b65bc07 100644 --- a/drivers/firmware/efi/fake_mem.c +++ b/drivers/firmware/efi/fake_mem.c @@ -59,7 +59,7 @@ void __init efi_fake_memmap(void) u64 start, end, m_start, m_end, m_attr; int new_nr_map = memmap.nr_map; efi_memory_desc_t *md; - u64 new_memmap_phy; + phys_addr_t new_memmap_phy; void *new_memmap; void *old, *new; int i; -- 1.8.3.1 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] efi: Fix warning of int-to-pointer-cast on x86 32-bit builds
commit-0f96a99 introduces the following warning message: drivers/firmware/efi/fake_mem.c:186:20: warning: cast to pointer from integer of different size [-Wint-to-pointer-cast] new_memmap_phy was defined as a u64 value and casted to void*. This causes a warning of int-to-pointer-cast on x86 32-bit environment. This patch changes the type of "new_memmap_phy" variable from "u64" into "phys_addr_t" to avoid it. Reported-by: Ingo Molnar <mi...@kernel.org> Signed-off-by: Taku Izumi <izumi.t...@jp.fujitsu.com> --- drivers/firmware/efi/fake_mem.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/firmware/efi/fake_mem.c b/drivers/firmware/efi/fake_mem.c index 32bcb14..b65bc07 100644 --- a/drivers/firmware/efi/fake_mem.c +++ b/drivers/firmware/efi/fake_mem.c @@ -59,7 +59,7 @@ void __init efi_fake_memmap(void) u64 start, end, m_start, m_end, m_attr; int new_nr_map = memmap.nr_map; efi_memory_desc_t *md; - u64 new_memmap_phy; + phys_addr_t new_memmap_phy; void *new_memmap; void *old, *new; int i; -- 1.8.3.1 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH] efi: Fix warning of int-to-pointer-cast on x86 32-bit builds
Dear Ard, > > commit-0f96a99 introduces the following warning message: > > > > drivers/firmware/efi/fake_mem.c:186:20: warning: cast to pointer > > from integer of different size [-Wint-to-pointer-cast] > > > > new_memmap_phy was defined as a u64 value and casted to void*. > > This causes a warning of int-to-pointer-cast on x86 32-bit > > environment. > > > > This patch changes the type of "new_memmap_phy" variable > > from "u64" into "phys_addr_t" to avoid it. > > This assumes sizeof(void*) == sizeof(phys_addr_t), which is not always true, > e.g., on 32-bit ARM (whose UEFI support is > in development but not yet merged) with LPAE enabled. > > Could we use unsigned long instead? Okay. I'll update my patch. Sincerely, Taku Izumi -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v2] efi: Fix warning of int-to-pointer-cast on x86 32-bit builds
commit-0f96a99 introduces the following warning message: drivers/firmware/efi/fake_mem.c:186:20: warning: cast to pointer from integer of different size [-Wint-to-pointer-cast] new_memmap_phy was defined as a u64 value and casted to void*. This causes a warning of int-to-pointer-cast on x86 32-bit environment. This patch changes the type of "new_memmap_phy" variable from "u64" into "ulong" to avoid it. v1 -> v2: - change the type of "new_memmap_phy" from phys_addr_t into ulong according to Ard's comment Reported-by: Ingo Molnar <mi...@kernel.org> Signed-off-by: Taku Izumi <izumi.t...@jp.fujitsu.com> --- drivers/firmware/efi/fake_mem.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/firmware/efi/fake_mem.c b/drivers/firmware/efi/fake_mem.c index 32bcb14..1f483b4 100644 --- a/drivers/firmware/efi/fake_mem.c +++ b/drivers/firmware/efi/fake_mem.c @@ -59,7 +59,7 @@ void __init efi_fake_memmap(void) u64 start, end, m_start, m_end, m_attr; int new_nr_map = memmap.nr_map; efi_memory_desc_t *md; - u64 new_memmap_phy; + ulong new_memmap_phy; void *new_memmap; void *old, *new; int i; -- 1.8.3.1 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH] mm: Introduce kernelcore=reliable option
Dear Tony, > -Original Message- > From: Luck, Tony [mailto:tony.l...@intel.com] > Sent: Friday, October 23, 2015 8:27 AM > To: Kamezawa, Hiroyuki/亀澤 寛之; Izumi, Taku/泉 拓; linux-kernel@vger.kernel.org; > linux...@kvack.org > Cc: qiuxi...@huawei.com; m...@csn.ul.ie; a...@linux-foundation.org; Hansen, > Dave; m...@codeblueprint.co.uk > Subject: RE: [PATCH] mm: Introduce kernelcore=reliable option > > > I think /proc/zoneinfo can show detailed numbers per zone. Do we need some > > for meminfo ? > > I wrote a little script (attached) to summarize /proc/zoneinfo ... on my > system it says > > $ zoneinfo > Node Normal Movable DMA DMA32 >00.00 103020.078.94 1554.46 >1 9284.5489870.43 >2 9626.3394050.09 >3 9602.8293650.04 > > Not sure why I have zero Normal memory free on node0. The sum of all those > free counts is 410667.72 MB ... which is close enough to the boot time message > showing the amount of mirror/total memory: > > [0.00] efi: Memory: 80979/420096M mirrored memory > > but a fair amount of the 80G of mirrored memory seems to have been miscounted > as Movable instead of Normal. Perhaps this is because I have two blocks of > mirrored > memory on each node and the movable zone code doesn't expect that? You were saying that OS view of memory of node is something like the following ? Node X: |MM--MM| (legend) M: mirrored -: not mirrrored If so, is this a real Box's configuration? Sorry, I haven't got a real Address Range Mirror capable boxes yet ... I thought mirroring range is concatenated at the first part of each node. Sincerely, Taku Izumi -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH] mm: Introduce kernelcore=reliable option
Hi Xishi, > On 2015/10/15 21:32, Taku Izumi wrote: > > > Xeon E7 v3 based systems supports Address Range Mirroring > > and UEFI BIOS complied with UEFI spec 2.5 can notify which > > ranges are reliable (mirrored) via EFI memory map. > > Now Linux kernel utilize its information and allocates > > boot time memory from reliable region. > > > > My requirement is: > > - allocate kernel memory from reliable region > > - allocate user memory from non-reliable region > > > > In order to meet my requirement, ZONE_MOVABLE is useful. > > By arranging non-reliable range into ZONE_MOVABLE, > > reliable memory is only used for kernel allocations. > > > > This patch extends existing "kernelcore" option and > > introduces kernelcore=reliable option. By specifying > > "reliable" instead of specifying the amount of memory, > > non-reliable region will be arranged into ZONE_MOVABLE. > > > > Earlier discussion is at: > > https://lkml.org/lkml/2015/10/9/24 > > > > Hi Taku, > > If user don't want to waste a lot of memory, and he only set > a few memory to mirrored memory, then the kernelcore is very > small, right? That means OS will have a very small normal zone > and a very large movable zone. Right. > Kernel allocation could only use the unmovable zone. As the > normal zone is very small, the kernel allocation maybe OOM, > right? Right. > Do you mean that we will reuse the movable zone in short-term > solution and create a new zone(mirrored zone) in future? If there is that kind of requirements, I don't oppose creating a new zone. Sincerely, Taku Izumi -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH] mm: Introduce kernelcore=reliable option
Hi Xishi, > On 2015/10/15 21:32, Taku Izumi wrote: > > > Xeon E7 v3 based systems supports Address Range Mirroring > > and UEFI BIOS complied with UEFI spec 2.5 can notify which > > ranges are reliable (mirrored) via EFI memory map. > > Now Linux kernel utilize its information and allocates > > boot time memory from reliable region. > > > > My requirement is: > > - allocate kernel memory from reliable region > > - allocate user memory from non-reliable region > > > > In order to meet my requirement, ZONE_MOVABLE is useful. > > By arranging non-reliable range into ZONE_MOVABLE, > > reliable memory is only used for kernel allocations. > > > > This patch extends existing "kernelcore" option and > > introduces kernelcore=reliable option. By specifying > > "reliable" instead of specifying the amount of memory, > > non-reliable region will be arranged into ZONE_MOVABLE. > > > > Earlier discussion is at: > > https://lkml.org/lkml/2015/10/9/24 > > > > Hi Taku, > > If user don't want to waste a lot of memory, and he only set > a few memory to mirrored memory, then the kernelcore is very > small, right? That means OS will have a very small normal zone > and a very large movable zone. Right. > Kernel allocation could only use the unmovable zone. As the > normal zone is very small, the kernel allocation maybe OOM, > right? Right. > Do you mean that we will reuse the movable zone in short-term > solution and create a new zone(mirrored zone) in future? If there is that kind of requirements, I don't oppose creating a new zone. Sincerely, Taku Izumi -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] mm: Introduce kernelcore=reliable option
Xeon E7 v3 based systems supports Address Range Mirroring and UEFI BIOS complied with UEFI spec 2.5 can notify which ranges are reliable (mirrored) via EFI memory map. Now Linux kernel utilize its information and allocates boot time memory from reliable region. My requirement is: - allocate kernel memory from reliable region - allocate user memory from non-reliable region In order to meet my requirement, ZONE_MOVABLE is useful. By arranging non-reliable range into ZONE_MOVABLE, reliable memory is only used for kernel allocations. This patch extends existing "kernelcore" option and introduces kernelcore=reliable option. By specifying "reliable" instead of specifying the amount of memory, non-reliable region will be arranged into ZONE_MOVABLE. Earlier discussion is at: https://lkml.org/lkml/2015/10/9/24 For example, suppose 2-nodes system with the following memory range: node 0 [mem 0x1000-0x00109fff] node 1 [mem 0x0010a000-0x00209fff] and the following ranges are marked as reliable (*): [0x-0x0001] [0x0001-0x00018000] [0x0010a000-0x00112000] If you specify kernelcore=reliable, Movable zones are arranged like the following: Movable zone start for each node Node 0: 0x00018000 Node 1: 0x00112000 (*) I specified the following instead of using UEFI BIOS complied with UEFI spec 2.5, efi_fake_mem=4G@0:0x1,2G@0x10a000:0x1,2G@4G:0x1 efi_fake_mem is found at: git://git.kernel.org/pub/scm/linux/kernel/git/mfleming/efi.git tags/efi-next Signed-off-by: Taku Izumi --- Documentation/kernel-parameters.txt | 9 - mm/page_alloc.c | 26 ++ 2 files changed, 34 insertions(+), 1 deletion(-) diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt index cd5312f..b2c8c13 100644 --- a/Documentation/kernel-parameters.txt +++ b/Documentation/kernel-parameters.txt @@ -1663,7 +1663,8 @@ bytes respectively. Such letter suffixes can also be entirely omitted. keepinitrd [HW,ARM] - kernelcore=nn[KMG] [KNL,X86,IA-64,PPC] This parameter + kernelcore= Format: nn[KMG] | "reliable" + [KNL,X86,IA-64,PPC] This parameter specifies the amount of memory usable by the kernel for non-movable allocations. The requested amount is spread evenly throughout all nodes in the system. The @@ -1679,6 +1680,12 @@ bytes respectively. Such letter suffixes can also be entirely omitted. use the HighMem zone if it exists, and the Normal zone if it does not. + Instead of specifying the amount of memory (nn[KMS]), + you can specify "reliable" option. In case "reliable" + option is specified, reliable memory is used for + non-movable allocations and remaining memory is used + for Movable pages. + kgdbdbgp= [KGDB,HW] kgdb over EHCI usb debug port. Format: [,poll interval] The controller # is the number of the ehci usb debug diff --git a/mm/page_alloc.c b/mm/page_alloc.c index beda417..d0b3ac9 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -221,6 +221,7 @@ static unsigned long __meminitdata arch_zone_highest_possible_pfn[MAX_NR_ZONES]; static unsigned long __initdata required_kernelcore; static unsigned long __initdata required_movablecore; static unsigned long __meminitdata zone_movable_pfn[MAX_NUMNODES]; +static bool reliable_kernelcore __initdata; /* movable_zone is the "real" zone pages in ZONE_MOVABLE are taken from */ int movable_zone; @@ -5618,6 +5619,25 @@ static void __init find_zone_movable_pfns_for_nodes(void) } /* +* If kernelcore=reliable is specified, ignore movablecore option +*/ + if (reliable_kernelcore) { + for_each_memblock(memory, r) { + if (memblock_is_mirror(r)) + continue; + + nid = r->nid; + + usable_startpfn = PFN_DOWN(r->base); + zone_movable_pfn[nid] = zone_movable_pfn[nid] ? + min(usable_startpfn, zone_movable_pfn[nid]) : + usable_startpfn; + } + + goto out2; + } + + /* * If movablecore=nn[KMG] was specified, calculate what size of * kernelcore that corresponds so that memory usable for * any allocation type is evenly spread. If both kernelcore @@ -5873,6 +5893,12 @@ static int __init cmdline_parse_core(char *p, unsigned long *core) */ stati
[PATCH] mm: Introduce kernelcore=reliable option
Xeon E7 v3 based systems supports Address Range Mirroring and UEFI BIOS complied with UEFI spec 2.5 can notify which ranges are reliable (mirrored) via EFI memory map. Now Linux kernel utilize its information and allocates boot time memory from reliable region. My requirement is: - allocate kernel memory from reliable region - allocate user memory from non-reliable region In order to meet my requirement, ZONE_MOVABLE is useful. By arranging non-reliable range into ZONE_MOVABLE, reliable memory is only used for kernel allocations. This patch extends existing "kernelcore" option and introduces kernelcore=reliable option. By specifying "reliable" instead of specifying the amount of memory, non-reliable region will be arranged into ZONE_MOVABLE. Earlier discussion is at: https://lkml.org/lkml/2015/10/9/24 For example, suppose 2-nodes system with the following memory range: node 0 [mem 0x1000-0x00109fff] node 1 [mem 0x0010a000-0x00209fff] and the following ranges are marked as reliable (*): [0x-0x0001] [0x0001-0x00018000] [0x0010a000-0x00112000] If you specify kernelcore=reliable, Movable zones are arranged like the following: Movable zone start for each node Node 0: 0x00018000 Node 1: 0x00112000 (*) I specified the following instead of using UEFI BIOS complied with UEFI spec 2.5, efi_fake_mem=4G@0:0x1,2G@0x10a000:0x1,2G@4G:0x1 efi_fake_mem is found at: git://git.kernel.org/pub/scm/linux/kernel/git/mfleming/efi.git tags/efi-next Signed-off-by: Taku Izumi <izumi.t...@jp.fujitsu.com> --- Documentation/kernel-parameters.txt | 9 - mm/page_alloc.c | 26 ++ 2 files changed, 34 insertions(+), 1 deletion(-) diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt index cd5312f..b2c8c13 100644 --- a/Documentation/kernel-parameters.txt +++ b/Documentation/kernel-parameters.txt @@ -1663,7 +1663,8 @@ bytes respectively. Such letter suffixes can also be entirely omitted. keepinitrd [HW,ARM] - kernelcore=nn[KMG] [KNL,X86,IA-64,PPC] This parameter + kernelcore= Format: nn[KMG] | "reliable" + [KNL,X86,IA-64,PPC] This parameter specifies the amount of memory usable by the kernel for non-movable allocations. The requested amount is spread evenly throughout all nodes in the system. The @@ -1679,6 +1680,12 @@ bytes respectively. Such letter suffixes can also be entirely omitted. use the HighMem zone if it exists, and the Normal zone if it does not. + Instead of specifying the amount of memory (nn[KMS]), + you can specify "reliable" option. In case "reliable" + option is specified, reliable memory is used for + non-movable allocations and remaining memory is used + for Movable pages. + kgdbdbgp= [KGDB,HW] kgdb over EHCI usb debug port. Format: <Controller#>[,poll interval] The controller # is the number of the ehci usb debug diff --git a/mm/page_alloc.c b/mm/page_alloc.c index beda417..d0b3ac9 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -221,6 +221,7 @@ static unsigned long __meminitdata arch_zone_highest_possible_pfn[MAX_NR_ZONES]; static unsigned long __initdata required_kernelcore; static unsigned long __initdata required_movablecore; static unsigned long __meminitdata zone_movable_pfn[MAX_NUMNODES]; +static bool reliable_kernelcore __initdata; /* movable_zone is the "real" zone pages in ZONE_MOVABLE are taken from */ int movable_zone; @@ -5618,6 +5619,25 @@ static void __init find_zone_movable_pfns_for_nodes(void) } /* +* If kernelcore=reliable is specified, ignore movablecore option +*/ + if (reliable_kernelcore) { + for_each_memblock(memory, r) { + if (memblock_is_mirror(r)) + continue; + + nid = r->nid; + + usable_startpfn = PFN_DOWN(r->base); + zone_movable_pfn[nid] = zone_movable_pfn[nid] ? + min(usable_startpfn, zone_movable_pfn[nid]) : + usable_startpfn; + } + + goto out2; + } + + /* * If movablecore=nn[KMG] was specified, calculate what size of * kernelcore that corresponds so that memory usable for * any allocation type is evenly spread. If both kernelcore @@ -5873,6 +5893,12 @@ static int __
RE: [PATCH][RFC] mm: Introduce kernelcore=reliable option
> > I remember Kame has already suggested this idea. In my opinion, > > I still think it's better to add a new migratetype or a new zone, > > so both user and kernel could use mirrored memory. > > A new zone would be more flexible ... and probably the right long > term solution. But this looks like a very clever was to try out the > feature with a minimally invasive patch. Yes. I agree creating a new zone is the right solution for long term. I believe this approach using MOVABLE_ZONE is good and reasonable for short-term solution. Sincerely, Taku Izumi -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH][RFC] mm: Introduce kernelcore=reliable option
> > I remember Kame has already suggested this idea. In my opinion, > > I still think it's better to add a new migratetype or a new zone, > > so both user and kernel could use mirrored memory. > > A new zone would be more flexible ... and probably the right long > term solution. But this looks like a very clever was to try out the > feature with a minimally invasive patch. Yes. I agree creating a new zone is the right solution for long term. I believe this approach using MOVABLE_ZONE is good and reasonable for short-term solution. Sincerely, Taku Izumi -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH][RFC] mm: Introduce kernelcore=reliable option
Xeon E7 v3 based systems supports Address Range Mirroring and UEFI BIOS complied with UEFI spec 2.5 can notify which ranges are reliable (mirrored) via EFI memory map. Now Linux kernel utilize its information and allocates boot time memory from reliable region. My requirement is: - allocate kernel memory from reliable region - allocate user memory from non-reliable region In order to meet my requirement, ZONE_MOVABLE is useful. By arranging non-reliable range into ZONE_MOVABLE, reliable memory is only used for kernel allocations. This patch extends existing "kernelcore" option and introduces kernelcore=reliable option. By specifying "reliable" instead of specifying the amount of memory, non-reliable region will be arranged into ZONE_MOVABLE. Signed-off-by: Taku Izumi --- Documentation/kernel-parameters.txt | 9 - mm/page_alloc.c | 26 ++ 2 files changed, 34 insertions(+), 1 deletion(-) diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt index 50fc09b..6791cbb 100644 --- a/Documentation/kernel-parameters.txt +++ b/Documentation/kernel-parameters.txt @@ -1669,7 +1669,8 @@ bytes respectively. Such letter suffixes can also be entirely omitted. keepinitrd [HW,ARM] - kernelcore=nn[KMG] [KNL,X86,IA-64,PPC] This parameter + kernelcore= Format: nn[KMG] | "reliable" + [KNL,X86,IA-64,PPC] This parameter specifies the amount of memory usable by the kernel for non-movable allocations. The requested amount is spread evenly throughout all nodes in the system. The @@ -1685,6 +1686,12 @@ bytes respectively. Such letter suffixes can also be entirely omitted. use the HighMem zone if it exists, and the Normal zone if it does not. + Instead of specifying the amount of memory (nn[KMS]), + you can specify "reliable" option. In case "reliable" + option is specified, reliable memory is used for + non-movable allocations and remaining memory is used + for Movable pages. + kgdbdbgp= [KGDB,HW] kgdb over EHCI usb debug port. Format: [,poll interval] The controller # is the number of the ehci usb debug diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 48aaf7b..91d7556 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -242,6 +242,7 @@ static unsigned long __meminitdata arch_zone_highest_possible_pfn[MAX_NR_ZONES]; static unsigned long __initdata required_kernelcore; static unsigned long __initdata required_movablecore; static unsigned long __meminitdata zone_movable_pfn[MAX_NUMNODES]; +static bool reliable_kernelcore __initdata; /* movable_zone is the "real" zone pages in ZONE_MOVABLE are taken from */ int movable_zone; @@ -5652,6 +5653,25 @@ static void __init find_zone_movable_pfns_for_nodes(void) } /* +* If kernelcore=reliable is specified, ignore movablecore option +*/ + if (reliable_kernelcore) { + for_each_memblock(memory, r) { + if (memblock_is_mirror(r)) + continue; + + nid = r->nid; + + usable_startpfn = PFN_DOWN(r->base); + zone_movable_pfn[nid] = zone_movable_pfn[nid] ? + min(usable_startpfn, zone_movable_pfn[nid]) : + usable_startpfn; + } + + goto out2; + } + + /* * If movablecore=nn[KMG] was specified, calculate what size of * kernelcore that corresponds so that memory usable for * any allocation type is evenly spread. If both kernelcore @@ -5907,6 +5927,12 @@ static int __init cmdline_parse_core(char *p, unsigned long *core) */ static int __init cmdline_parse_kernelcore(char *p) { + /* parse kernelcore=reliable */ + if (parse_option_str(p, "reliable")) { + reliable_kernelcore = true; + return 0; + } + return cmdline_parse_core(p, _kernelcore); } -- 1.8.3.1 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH][RFC] mm: Introduce kernelcore=reliable option
Xeon E7 v3 based systems supports Address Range Mirroring and UEFI BIOS complied with UEFI spec 2.5 can notify which ranges are reliable (mirrored) via EFI memory map. Now Linux kernel utilize its information and allocates boot time memory from reliable region. My requirement is: - allocate kernel memory from reliable region - allocate user memory from non-reliable region In order to meet my requirement, ZONE_MOVABLE is useful. By arranging non-reliable range into ZONE_MOVABLE, reliable memory is only used for kernel allocations. This patch extends existing "kernelcore" option and introduces kernelcore=reliable option. By specifying "reliable" instead of specifying the amount of memory, non-reliable region will be arranged into ZONE_MOVABLE. Signed-off-by: Taku Izumi <izumi.t...@jp.fujitsu.com> --- Documentation/kernel-parameters.txt | 9 - mm/page_alloc.c | 26 ++ 2 files changed, 34 insertions(+), 1 deletion(-) diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt index 50fc09b..6791cbb 100644 --- a/Documentation/kernel-parameters.txt +++ b/Documentation/kernel-parameters.txt @@ -1669,7 +1669,8 @@ bytes respectively. Such letter suffixes can also be entirely omitted. keepinitrd [HW,ARM] - kernelcore=nn[KMG] [KNL,X86,IA-64,PPC] This parameter + kernelcore= Format: nn[KMG] | "reliable" + [KNL,X86,IA-64,PPC] This parameter specifies the amount of memory usable by the kernel for non-movable allocations. The requested amount is spread evenly throughout all nodes in the system. The @@ -1685,6 +1686,12 @@ bytes respectively. Such letter suffixes can also be entirely omitted. use the HighMem zone if it exists, and the Normal zone if it does not. + Instead of specifying the amount of memory (nn[KMS]), + you can specify "reliable" option. In case "reliable" + option is specified, reliable memory is used for + non-movable allocations and remaining memory is used + for Movable pages. + kgdbdbgp= [KGDB,HW] kgdb over EHCI usb debug port. Format: <Controller#>[,poll interval] The controller # is the number of the ehci usb debug diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 48aaf7b..91d7556 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -242,6 +242,7 @@ static unsigned long __meminitdata arch_zone_highest_possible_pfn[MAX_NR_ZONES]; static unsigned long __initdata required_kernelcore; static unsigned long __initdata required_movablecore; static unsigned long __meminitdata zone_movable_pfn[MAX_NUMNODES]; +static bool reliable_kernelcore __initdata; /* movable_zone is the "real" zone pages in ZONE_MOVABLE are taken from */ int movable_zone; @@ -5652,6 +5653,25 @@ static void __init find_zone_movable_pfns_for_nodes(void) } /* +* If kernelcore=reliable is specified, ignore movablecore option +*/ + if (reliable_kernelcore) { + for_each_memblock(memory, r) { + if (memblock_is_mirror(r)) + continue; + + nid = r->nid; + + usable_startpfn = PFN_DOWN(r->base); + zone_movable_pfn[nid] = zone_movable_pfn[nid] ? + min(usable_startpfn, zone_movable_pfn[nid]) : + usable_startpfn; + } + + goto out2; + } + + /* * If movablecore=nn[KMG] was specified, calculate what size of * kernelcore that corresponds so that memory usable for * any allocation type is evenly spread. If both kernelcore @@ -5907,6 +5927,12 @@ static int __init cmdline_parse_core(char *p, unsigned long *core) */ static int __init cmdline_parse_kernelcore(char *p) { + /* parse kernelcore=reliable */ + if (parse_option_str(p, "reliable")) { + reliable_kernelcore = true; + return 0; + } + return cmdline_parse_core(p, _kernelcore); } -- 1.8.3.1 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[tip:perf/core] perf/x86/intel/uncore: Fix multi-segment problem of perf_event_intel_uncore
Commit-ID: 712df65ccb63da08a484bf57c40b250dfd4103a7 Gitweb: http://git.kernel.org/tip/712df65ccb63da08a484bf57c40b250dfd4103a7 Author: Taku Izumi AuthorDate: Thu, 24 Sep 2015 21:10:21 +0900 Committer: Ingo Molnar CommitDate: Tue, 6 Oct 2015 17:31:51 +0200 perf/x86/intel/uncore: Fix multi-segment problem of perf_event_intel_uncore In multi-segment system, uncore devices may belong to buses whose segment number is other than 0: :ff:10.5 System peripheral: Intel Corporation Xeon E5 v3/Core i7 Scratchpad & Semaphore Registers (rev 03) ... 0001:7f:10.5 System peripheral: Intel Corporation Xeon E5 v3/Core i7 Scratchpad & Semaphore Registers (rev 03) ... 0001:bf:10.5 System peripheral: Intel Corporation Xeon E5 v3/Core i7 Scratchpad & Semaphore Registers (rev 03) ... 0001:ff:10.5 System peripheral: Intel Corporation Xeon E5 v3/Core i7 Scratchpad & Semaphore Registers (rev 03 ... In that case, relation of bus number and physical id may be broken because "uncore_pcibus_to_physid" doesn't take account of PCI segment. For example, bus :ff and 0001:ff uses the same entry of "uncore_pcibus_to_physid" array. This patch fixes this problem by introducing the segment-aware pci2phy_map instead. Signed-off-by: Taku Izumi Signed-off-by: Peter Zijlstra (Intel) Cc: Arnaldo Carvalho de Melo Cc: Jiri Olsa Cc: Linus Torvalds Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: a...@kernel.org Cc: h...@zytor.com Link: http://lkml.kernel.org/r/1443096621-4119-1-git-send-email-izumi.t...@jp.fujitsu.com Signed-off-by: Ingo Molnar --- arch/x86/kernel/cpu/perf_event_intel_uncore.c | 61 -- arch/x86/kernel/cpu/perf_event_intel_uncore.h | 12 - arch/x86/kernel/cpu/perf_event_intel_uncore_snb.c | 16 -- .../x86/kernel/cpu/perf_event_intel_uncore_snbep.c | 32 +--- 4 files changed, 106 insertions(+), 15 deletions(-) diff --git a/arch/x86/kernel/cpu/perf_event_intel_uncore.c b/arch/x86/kernel/cpu/perf_event_intel_uncore.c index 560e525..61215a6 100644 --- a/arch/x86/kernel/cpu/perf_event_intel_uncore.c +++ b/arch/x86/kernel/cpu/perf_event_intel_uncore.c @@ -7,7 +7,8 @@ struct intel_uncore_type **uncore_pci_uncores = empty_uncore; static bool pcidrv_registered; struct pci_driver *uncore_pci_driver; /* pci bus to socket mapping */ -int uncore_pcibus_to_physid[256] = { [0 ... 255] = -1, }; +DEFINE_RAW_SPINLOCK(pci2phy_map_lock); +struct list_head pci2phy_map_head = LIST_HEAD_INIT(pci2phy_map_head); struct pci_dev *uncore_extra_pci_dev[UNCORE_SOCKET_MAX][UNCORE_EXTRA_PCI_DEV_MAX]; static DEFINE_RAW_SPINLOCK(uncore_box_lock); @@ -20,6 +21,59 @@ static struct event_constraint uncore_constraint_fixed = struct event_constraint uncore_constraint_empty = EVENT_CONSTRAINT(0, 0, 0); +int uncore_pcibus_to_physid(struct pci_bus *bus) +{ + struct pci2phy_map *map; + int phys_id = -1; + + raw_spin_lock(_map_lock); + list_for_each_entry(map, _map_head, list) { + if (map->segment == pci_domain_nr(bus)) { + phys_id = map->pbus_to_physid[bus->number]; + break; + } + } + raw_spin_unlock(_map_lock); + + return phys_id; +} + +struct pci2phy_map *__find_pci2phy_map(int segment) +{ + struct pci2phy_map *map, *alloc = NULL; + int i; + + lockdep_assert_held(_map_lock); + +lookup: + list_for_each_entry(map, _map_head, list) { + if (map->segment == segment) + goto end; + } + + if (!alloc) { + raw_spin_unlock(_map_lock); + alloc = kmalloc(sizeof(struct pci2phy_map), GFP_KERNEL); + raw_spin_lock(_map_lock); + + if (!alloc) + return NULL; + + goto lookup; + } + + map = alloc; + alloc = NULL; + map->segment = segment; + for (i = 0; i < 256; i++) + map->pbus_to_physid[i] = -1; + list_add_tail(>list, _map_head); + +end: + kfree(alloc); + return map; +} + ssize_t uncore_event_show(struct kobject *kobj, struct kobj_attribute *attr, char *buf) { @@ -809,7 +863,7 @@ static int uncore_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id int phys_id; bool first_box = false; - phys_id = uncore_pcibus_to_physid[pdev->bus->number]; + phys_id = uncore_pcibus_to_physid(pdev->bus); if (phys_id < 0) return -ENODEV; @@ -856,9 +910,10 @@ static void uncore_pci_remove(struct pci_dev *pdev) { struct intel_uncore_box *box = pci_get_drvdata(pdev); struct intel_uncore_pmu *pmu; - int i, cpu, phys_id = uncore_pcibus_to_physid[pdev->bus->number]; + int i, cpu, phys_id; bool last_box = false; + phys_id = uncore_pcib
[tip:perf/core] perf/x86/intel/uncore: Fix multi-segment problem of perf_event_intel_uncore
Commit-ID: 712df65ccb63da08a484bf57c40b250dfd4103a7 Gitweb: http://git.kernel.org/tip/712df65ccb63da08a484bf57c40b250dfd4103a7 Author: Taku Izumi <izumi.t...@jp.fujitsu.com> AuthorDate: Thu, 24 Sep 2015 21:10:21 +0900 Committer: Ingo Molnar <mi...@kernel.org> CommitDate: Tue, 6 Oct 2015 17:31:51 +0200 perf/x86/intel/uncore: Fix multi-segment problem of perf_event_intel_uncore In multi-segment system, uncore devices may belong to buses whose segment number is other than 0: :ff:10.5 System peripheral: Intel Corporation Xeon E5 v3/Core i7 Scratchpad & Semaphore Registers (rev 03) ... 0001:7f:10.5 System peripheral: Intel Corporation Xeon E5 v3/Core i7 Scratchpad & Semaphore Registers (rev 03) ... 0001:bf:10.5 System peripheral: Intel Corporation Xeon E5 v3/Core i7 Scratchpad & Semaphore Registers (rev 03) ... 0001:ff:10.5 System peripheral: Intel Corporation Xeon E5 v3/Core i7 Scratchpad & Semaphore Registers (rev 03 ... In that case, relation of bus number and physical id may be broken because "uncore_pcibus_to_physid" doesn't take account of PCI segment. For example, bus :ff and 0001:ff uses the same entry of "uncore_pcibus_to_physid" array. This patch fixes this problem by introducing the segment-aware pci2phy_map instead. Signed-off-by: Taku Izumi <izumi.t...@jp.fujitsu.com> Signed-off-by: Peter Zijlstra (Intel) <pet...@infradead.org> Cc: Arnaldo Carvalho de Melo <a...@redhat.com> Cc: Jiri Olsa <jo...@redhat.com> Cc: Linus Torvalds <torva...@linux-foundation.org> Cc: Peter Zijlstra <pet...@infradead.org> Cc: Thomas Gleixner <t...@linutronix.de> Cc: a...@kernel.org Cc: h...@zytor.com Link: http://lkml.kernel.org/r/1443096621-4119-1-git-send-email-izumi.t...@jp.fujitsu.com Signed-off-by: Ingo Molnar <mi...@kernel.org> --- arch/x86/kernel/cpu/perf_event_intel_uncore.c | 61 -- arch/x86/kernel/cpu/perf_event_intel_uncore.h | 12 - arch/x86/kernel/cpu/perf_event_intel_uncore_snb.c | 16 -- .../x86/kernel/cpu/perf_event_intel_uncore_snbep.c | 32 +--- 4 files changed, 106 insertions(+), 15 deletions(-) diff --git a/arch/x86/kernel/cpu/perf_event_intel_uncore.c b/arch/x86/kernel/cpu/perf_event_intel_uncore.c index 560e525..61215a6 100644 --- a/arch/x86/kernel/cpu/perf_event_intel_uncore.c +++ b/arch/x86/kernel/cpu/perf_event_intel_uncore.c @@ -7,7 +7,8 @@ struct intel_uncore_type **uncore_pci_uncores = empty_uncore; static bool pcidrv_registered; struct pci_driver *uncore_pci_driver; /* pci bus to socket mapping */ -int uncore_pcibus_to_physid[256] = { [0 ... 255] = -1, }; +DEFINE_RAW_SPINLOCK(pci2phy_map_lock); +struct list_head pci2phy_map_head = LIST_HEAD_INIT(pci2phy_map_head); struct pci_dev *uncore_extra_pci_dev[UNCORE_SOCKET_MAX][UNCORE_EXTRA_PCI_DEV_MAX]; static DEFINE_RAW_SPINLOCK(uncore_box_lock); @@ -20,6 +21,59 @@ static struct event_constraint uncore_constraint_fixed = struct event_constraint uncore_constraint_empty = EVENT_CONSTRAINT(0, 0, 0); +int uncore_pcibus_to_physid(struct pci_bus *bus) +{ + struct pci2phy_map *map; + int phys_id = -1; + + raw_spin_lock(_map_lock); + list_for_each_entry(map, _map_head, list) { + if (map->segment == pci_domain_nr(bus)) { + phys_id = map->pbus_to_physid[bus->number]; + break; + } + } + raw_spin_unlock(_map_lock); + + return phys_id; +} + +struct pci2phy_map *__find_pci2phy_map(int segment) +{ + struct pci2phy_map *map, *alloc = NULL; + int i; + + lockdep_assert_held(_map_lock); + +lookup: + list_for_each_entry(map, _map_head, list) { + if (map->segment == segment) + goto end; + } + + if (!alloc) { + raw_spin_unlock(_map_lock); + alloc = kmalloc(sizeof(struct pci2phy_map), GFP_KERNEL); + raw_spin_lock(_map_lock); + + if (!alloc) + return NULL; + + goto lookup; + } + + map = alloc; + alloc = NULL; + map->segment = segment; + for (i = 0; i < 256; i++) + map->pbus_to_physid[i] = -1; + list_add_tail(>list, _map_head); + +end: + kfree(alloc); + return map; +} + ssize_t uncore_event_show(struct kobject *kobj, struct kobj_attribute *attr, char *buf) { @@ -809,7 +863,7 @@ static int uncore_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id int phys_id; bool first_box = false; - phys_id = uncore_pcibus_to_physid[pdev->bus->number]; + phys_id = uncore_pcibus_to_physid(pdev->bus); if (phys_id < 0) return -ENODEV; @@ -856,9 +910,10 @@ static void uncore_pci_remove(struct pci_dev *pdev
[PATCH 2/2] x86, efi: Add "efi_fake_mem" boot option
This patch introduces new boot option named "efi_fake_mem". By specifying this parameter, you can add arbitrary attribute to specific memory range. This is useful for debugging of Address Range Mirroring feature. For example, if "efi_fake_mem=2G@4G:0x1,2G@0x10a000:0x1" is specified, the original (firmware provided) EFI memmap will be updated so that the specified memory regions have EFI_MEMORY_MORE_RELIABLE attribute (0x1): efi: mem36: [Conventional Memory| | | | | | |WB|WT|WC|UC] range=[0x0001-0x0020a000) (129536MB) efi: mem36: [Conventional Memory| |MR| | | | |WB|WT|WC|UC] range=[0x0001-0x00018000) (2048MB) efi: mem37: [Conventional Memory| | | | | | |WB|WT|WC|UC] range=[0x00018000-0x0010a000) (61952MB) efi: mem38: [Conventional Memory| |MR| | | | |WB|WT|WC|UC] range=[0x0010a000-0x00112000) (2048MB) efi: mem39: [Conventional Memory| | | | | | |WB|WT|WC|UC] range=[0x00112000-0x0020a000) (63488MB) And you will find that the following message is output: efi: Memory: 4096M/131455M mirrored memory Signed-off-by: Taku Izumi --- Documentation/kernel-parameters.txt | 15 +++ arch/x86/kernel/setup.c | 4 +- drivers/firmware/efi/Kconfig| 22 drivers/firmware/efi/Makefile | 1 + drivers/firmware/efi/fake_mem.c | 238 include/linux/efi.h | 6 + 6 files changed, 285 insertions(+), 1 deletion(-) create mode 100644 drivers/firmware/efi/fake_mem.c diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt index 22a4b68..50fc09b 100644 --- a/Documentation/kernel-parameters.txt +++ b/Documentation/kernel-parameters.txt @@ -1094,6 +1094,21 @@ bytes respectively. Such letter suffixes can also be entirely omitted. you are really sure that your UEFI does sane gc and fulfills the spec otherwise your board may brick. + efi_fake_mem= nn[KMG]@ss[KMG]:aa[,nn[KMG]@ss[KMG]:aa,..] [EFI; X86] + Add arbitrary attribute to specific memory range by + updating original EFI memory map. + Region of memory which aa attribute is added to is + from ss to ss+nn. + If efi_fake_mem=2G@4G:0x1,2G@0x10a000:0x1 + is specified, EFI_MEMORY_MORE_RELIABLE(0x1) + attribute is added to range 0x1-0x18000 and + 0x10a000-0x112000. + + Using this parameter you can do debugging of EFI memmap + related feature. For example, you can do debugging of + Address Range Mirroring feature even if your box + doesn't support it. + eisa_irq_edge= [PARISC,HW] See header of drivers/parisc/eisa.c. diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c index fdb7f2a..30b4c44 100644 --- a/arch/x86/kernel/setup.c +++ b/arch/x86/kernel/setup.c @@ -1079,8 +1079,10 @@ void __init setup_arch(char **cmdline_p) memblock_set_current_limit(ISA_END_ADDRESS); memblock_x86_fill(); - if (efi_enabled(EFI_BOOT)) + if (efi_enabled(EFI_BOOT)) { + efi_fake_memmap(); efi_find_mirror(); + } /* * The EFI specification says that boot service code won't be called diff --git a/drivers/firmware/efi/Kconfig b/drivers/firmware/efi/Kconfig index 84533e0..ac47cc4d 100644 --- a/drivers/firmware/efi/Kconfig +++ b/drivers/firmware/efi/Kconfig @@ -52,6 +52,28 @@ config EFI_RUNTIME_MAP See also Documentation/ABI/testing/sysfs-firmware-efi-runtime-map. +config EFI_FAKE_MEMMAP + bool "Enable EFI fake memory map" + depends on EFI && X86 + default n + help + Saying Y here will enable "efi_fake_mem" boot option. + By specifying this parameter, you can add arbitrary attribute + to specific memory range by updating original (firmware provided) + EFI memmap. + This is useful for debugging of EFI memmap related feature. + e.g. Address Range Mirroring feature. + +config EFI_MAX_FAKE_MEM + int "maximum allowable number of ranges in efi_fake_mem boot option" + depends on EFI && X86 && EFI_FAKE_MEMMAP + range 1 128 + default 8 + help + Maximum allowable number of ranges in efi_fake_mem boot option. + Ranges can be set up to this value using comma-separated list. + The default value is 8. + config EFI_PARAMS_FROM_FDT bool help diff --git a/drivers/firmware/efi/Makefile b/drivers/firmware/efi/Makefile index 6fd3da9..c24
RE: [PATCH 2/2] x86, efi: Add "efi_fake_mem" boot option
I've missed git-format-patch after rebasing. I'll resend right one.. > -Original Message- > From: kbuild test robot [mailto:l...@intel.com] > Sent: Wednesday, September 30, 2015 10:37 AM > To: Izumi, Taku/泉 拓 > Cc: kbuild-...@01.org; linux-kernel@vger.kernel.org; > linux-...@vger.kernel.org; x...@kernel.org; matt.flem...@intel.com; > t...@linutronix.de; mi...@redhat.com; h...@zytor.com; tony.l...@intel.com; > qinxi...@huawei.com; Kamezawa, Hiroyuki/亀 > 澤 寛之; ard.biesheu...@linaro.org; Izumi, Taku/泉 拓 > Subject: Re: [PATCH 2/2] x86, efi: Add "efi_fake_mem" boot option > > Hi Taku, > > [auto build test results on v4.3-rc3 -- if it's inappropriate base, please > ignore] > > config: i386-allmodconfig (attached as .config) > reproduce: > git checkout afcc94d3f91a00ce97d735a563a8e5d595f45a03 > # save the attached .config to linux build tree > make ARCH=i386 > > All error/warnings (new ones prefixed by >>): > > >> drivers/firmware/efi/fake_mem.c:36:25: error: 'CONFIG_EFI_MAX_FAKEMEM' > >> undeclared here (not in a function) > #define EFI_MAX_FAKEMEM CONFIG_EFI_MAX_FAKEMEM > ^ > >> drivers/firmware/efi/fake_mem.c:42:34: note: in expansion of macro > >> 'EFI_MAX_FAKEMEM' > static struct fake_mem fake_mems[EFI_MAX_FAKEMEM]; > ^ >drivers/firmware/efi/fake_mem.c: In function 'efi_fake_memmap': > >> drivers/firmware/efi/fake_mem.c:186:20: warning: cast to pointer from > >> integer of different size [-Wint-to-pointer-cast] > memmap.phys_map = (void *)new_memmap_phy; >^ >drivers/firmware/efi/fake_mem.c: At top level: > >> drivers/firmware/efi/fake_mem.c:42:24: warning: 'fake_mems' defined but > >> not used [-Wunused-variable] > static struct fake_mem fake_mems[EFI_MAX_FAKEMEM]; >^ > > vim +/CONFIG_EFI_MAX_FAKEMEM +36 drivers/firmware/efi/fake_mem.c > > 30#include > 31#include > 32#include > 33#include > 34#include > 35 > > 36#define EFI_MAX_FAKEMEM CONFIG_EFI_MAX_FAKEMEM > 37 > 38struct fake_mem { > 39struct range range; > 40u64 attribute; > 41}; > > 42static struct fake_mem fake_mems[EFI_MAX_FAKEMEM]; > 43static int nr_fake_mem; > 44 > 45static int __init cmp_fake_mem(const void *x1, const void *x2) > 46{ > 47const struct fake_mem *m1 = x1; > 48const struct fake_mem *m2 = x2; > 49 > 50if (m1->range.start < m2->range.start) > 51return -1; > 52if (m1->range.start > m2->range.start) > 53return 1; > 54return 0; > 55} > 56 > 57void __init efi_fake_memmap(void) > 58{ > 59u64 start, end, m_start, m_end, m_attr; > 60int new_nr_map = memmap.nr_map; > 61efi_memory_desc_t *md; > 62u64 new_memmap_phy; > 63void *new_memmap; > 64void *old, *new; > 65int i; > 66 > 67if (!nr_fake_mem || !efi_enabled(EFI_MEMMAP)) > 68return; > 69 > 70/* count up the number of EFI memory descriptor */ > 71for (old = memmap.map; old < memmap.map_end; old += > memmap.desc_size) { > 72md = old; > 73start = md->phys_addr; > 74end = start + (md->num_pages << EFI_PAGE_SHIFT) > - 1; > 75 > 76for (i = 0; i < nr_fake_mem; i++) { > 77/* modifying range */ > 78m_start = fake_mems[i].range.start; > 79m_end = fake_mems[i].range.end; > 80 > 81if (m_start <= start) { > 82/* split into 2 parts */ > 83if (start < m_end && m_end < > end) > 84new_nr_map++; > 85} > 86if (start < m_start && m_start < end) { > 87/
[PATCH 2/2] x86, efi: Add "efi_fake_mem" boot option
This patch introduces new boot option named "efi_fake_mem". By specifying this parameter, you can add arbitrary attribute to specific memory range. This is useful for debugging of Address Range Mirroring feature. For example, if "efi_fake_mem=2G@4G:0x1,2G@0x10a000:0x1" is specified, the original (firmware provided) EFI memmap will be updated so that the specified memory regions have EFI_MEMORY_MORE_RELIABLE attribute (0x1): efi: mem36: [Conventional Memory| | | | | | |WB|WT|WC|UC] range=[0x0001-0x0020a000) (129536MB) efi: mem36: [Conventional Memory| |MR| | | | |WB|WT|WC|UC] range=[0x0001-0x00018000) (2048MB) efi: mem37: [Conventional Memory| | | | | | |WB|WT|WC|UC] range=[0x00018000-0x0010a000) (61952MB) efi: mem38: [Conventional Memory| |MR| | | | |WB|WT|WC|UC] range=[0x0010a000-0x00112000) (2048MB) efi: mem39: [Conventional Memory| | | | | | |WB|WT|WC|UC] range=[0x00112000-0x0020a000) (63488MB) And you will find that the following message is output: efi: Memory: 4096M/131455M mirrored memory Signed-off-by: Taku Izumi --- Documentation/kernel-parameters.txt | 15 +++ arch/x86/kernel/setup.c | 4 +- drivers/firmware/efi/Kconfig| 22 drivers/firmware/efi/Makefile | 1 + drivers/firmware/efi/fake_mem.c | 238 include/linux/efi.h | 6 + 6 files changed, 285 insertions(+), 1 deletion(-) create mode 100644 drivers/firmware/efi/fake_mem.c diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt index 22a4b68..50fc09b 100644 --- a/Documentation/kernel-parameters.txt +++ b/Documentation/kernel-parameters.txt @@ -1094,6 +1094,21 @@ bytes respectively. Such letter suffixes can also be entirely omitted. you are really sure that your UEFI does sane gc and fulfills the spec otherwise your board may brick. + efi_fake_mem= nn[KMG]@ss[KMG]:aa[,nn[KMG]@ss[KMG]:aa,..] [EFI; X86] + Add arbitrary attribute to specific memory range by + updating original EFI memory map. + Region of memory which aa attribute is added to is + from ss to ss+nn. + If efi_fake_mem=2G@4G:0x1,2G@0x10a000:0x1 + is specified, EFI_MEMORY_MORE_RELIABLE(0x1) + attribute is added to range 0x1-0x18000 and + 0x10a000-0x112000. + + Using this parameter you can do debugging of EFI memmap + related feature. For example, you can do debugging of + Address Range Mirroring feature even if your box + doesn't support it. + eisa_irq_edge= [PARISC,HW] See header of drivers/parisc/eisa.c. diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c index fdb7f2a..30b4c44 100644 --- a/arch/x86/kernel/setup.c +++ b/arch/x86/kernel/setup.c @@ -1079,8 +1079,10 @@ void __init setup_arch(char **cmdline_p) memblock_set_current_limit(ISA_END_ADDRESS); memblock_x86_fill(); - if (efi_enabled(EFI_BOOT)) + if (efi_enabled(EFI_BOOT)) { + efi_fake_memmap(); efi_find_mirror(); + } /* * The EFI specification says that boot service code won't be called diff --git a/drivers/firmware/efi/Kconfig b/drivers/firmware/efi/Kconfig index 84533e0..ac47cc4d 100644 --- a/drivers/firmware/efi/Kconfig +++ b/drivers/firmware/efi/Kconfig @@ -52,6 +52,28 @@ config EFI_RUNTIME_MAP See also Documentation/ABI/testing/sysfs-firmware-efi-runtime-map. +config EFI_FAKE_MEMMAP + bool "Enable EFI fake memory map" + depends on EFI && X86 + default n + help + Saying Y here will enable "efi_fake_mem" boot option. + By specifying this parameter, you can add arbitrary attribute + to specific memory range by updating original (firmware provided) + EFI memmap. + This is useful for debugging of EFI memmap related feature. + e.g. Address Range Mirroring feature. + +config EFI_MAX_FAKE_MEM + int "maximum allowable number of ranges in efi_fake_mem boot option" + depends on EFI && X86 && EFI_FAKE_MEMMAP + range 1 128 + default 8 + help + Maximum allowable number of ranges in efi_fake_mem boot option. + Ranges can be set up to this value using comma-separated list. + The default value is 8. + config EFI_PARAMS_FROM_FDT bool help diff --git a/drivers/firmware/efi/Makefile b/drivers/firmware/efi/Makefile index 6fd3da9..c24
[PATCH 0/2] Introduce "efi_fake_mem" boot option
UEFI spec 2.5 introduces new Memory Attribute Definition named EFI_MEMORY_MORE_RELIABLE which indicates which memory ranges are mirrored. Now linux kernel can recognize which memory ranges are mirrored by handling EFI_MEMORY_MORE_RELIABLE attributes. However testing this feature necesitates boxes with UEFI spec 2.5 complied firmware. This patchset introduces new boot option named "efi_fake_mem". By specifying this parameter, you can add arbitrary attribute to specific memory range. This is useful for debugging of Memory Address Range Mirroring feature. This is updated version one of the former patch posted at http://www.mail-archive.com/linux-efi@vger.kernel.org/msg05936.html changelog: - change boot option name and spec efi_fake_mem_mirror=nn@ss -> efi_fake_mem=nn@ss:aa - rename print_efi_memmap() to efi_print_memmap() - introduce new config named CONFIG_EFI_MAX_FAKE_MEM - and some fix pointed by Matt Flemming Taku Izumi (2): x86, efi: rename print_efi_memmap() to efi_print_memmap() x86, efi: Add "efi_fake_mem" boot option Documentation/kernel-parameters.txt | 15 +++ arch/x86/include/asm/efi.h | 1 + arch/x86/kernel/setup.c | 4 +- arch/x86/platform/efi/efi.c | 4 +- drivers/firmware/efi/Kconfig| 22 drivers/firmware/efi/Makefile | 1 + drivers/firmware/efi/fake_mem.c | 238 include/linux/efi.h | 6 + 8 files changed, 288 insertions(+), 3 deletions(-) create mode 100644 drivers/firmware/efi/fake_mem.c -- 1.8.3.1 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 1/2] x86, efi: rename print_efi_memmap() to efi_print_memmap()
This patch renames print_efi_memmap() to efi_print_memmap() and make it global function so that we can invoke it outside of arch/x86/platform/efi/efi.c Signed-off-by: Taku Izumi --- arch/x86/include/asm/efi.h | 1 + arch/x86/platform/efi/efi.c | 4 ++-- 2 files changed, 3 insertions(+), 2 deletions(-) diff --git a/arch/x86/include/asm/efi.h b/arch/x86/include/asm/efi.h index ab5f1d4..f8b93d6 100644 --- a/arch/x86/include/asm/efi.h +++ b/arch/x86/include/asm/efi.h @@ -103,6 +103,7 @@ extern void __init efi_set_executable(efi_memory_desc_t *md, bool executable); extern int __init efi_memblock_x86_reserve_range(void); extern pgd_t * __init efi_call_phys_prolog(void); extern void __init efi_call_phys_epilog(pgd_t *save_pgd); +extern void __init efi_print_memmap(void); extern void __init efi_unmap_memmap(void); extern void __init efi_memory_uc(u64 addr, unsigned long size); extern void __init efi_map_region(efi_memory_desc_t *md); diff --git a/arch/x86/platform/efi/efi.c b/arch/x86/platform/efi/efi.c index 1db84c0..1f95caf 100644 --- a/arch/x86/platform/efi/efi.c +++ b/arch/x86/platform/efi/efi.c @@ -222,7 +222,7 @@ int __init efi_memblock_x86_reserve_range(void) return 0; } -static void __init print_efi_memmap(void) +void __init efi_print_memmap(void) { #ifdef EFI_DEBUG efi_memory_desc_t *md; @@ -524,7 +524,7 @@ void __init efi_init(void) return; if (efi_enabled(EFI_DBG)) - print_efi_memmap(); + efi_print_memmap(); efi_esrt_init(); } -- 1.8.3.1 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 2/2] x86, efi: Add "efi_fake_mem" boot option
This patch introduces new boot option named "efi_fake_mem". By specifying this parameter, you can add arbitrary attribute to specific memory range. This is useful for debugging of Address Range Mirroring feature. For example, if "efi_fake_mem=2G@4G:0x1,2G@0x10a000:0x1" is specified, the original (firmware provided) EFI memmap will be updated so that the specified memory regions have EFI_MEMORY_MORE_RELIABLE attribute (0x1): efi: mem36: [Conventional Memory| | | | | | |WB|WT|WC|UC] range=[0x0001-0x0020a000) (129536MB) efi: mem36: [Conventional Memory| |MR| | | | |WB|WT|WC|UC] range=[0x0001-0x00018000) (2048MB) efi: mem37: [Conventional Memory| | | | | | |WB|WT|WC|UC] range=[0x00018000-0x0010a000) (61952MB) efi: mem38: [Conventional Memory| |MR| | | | |WB|WT|WC|UC] range=[0x0010a000-0x00112000) (2048MB) efi: mem39: [Conventional Memory| | | | | | |WB|WT|WC|UC] range=[0x00112000-0x0020a000) (63488MB) And you will find that the following message is output: efi: Memory: 4096M/131455M mirrored memory Signed-off-by: Taku Izumi <izumi.t...@jp.fujitsu.com> --- Documentation/kernel-parameters.txt | 15 +++ arch/x86/kernel/setup.c | 4 +- drivers/firmware/efi/Kconfig| 22 drivers/firmware/efi/Makefile | 1 + drivers/firmware/efi/fake_mem.c | 238 include/linux/efi.h | 6 + 6 files changed, 285 insertions(+), 1 deletion(-) create mode 100644 drivers/firmware/efi/fake_mem.c diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt index 22a4b68..50fc09b 100644 --- a/Documentation/kernel-parameters.txt +++ b/Documentation/kernel-parameters.txt @@ -1094,6 +1094,21 @@ bytes respectively. Such letter suffixes can also be entirely omitted. you are really sure that your UEFI does sane gc and fulfills the spec otherwise your board may brick. + efi_fake_mem= nn[KMG]@ss[KMG]:aa[,nn[KMG]@ss[KMG]:aa,..] [EFI; X86] + Add arbitrary attribute to specific memory range by + updating original EFI memory map. + Region of memory which aa attribute is added to is + from ss to ss+nn. + If efi_fake_mem=2G@4G:0x1,2G@0x10a000:0x1 + is specified, EFI_MEMORY_MORE_RELIABLE(0x1) + attribute is added to range 0x1-0x18000 and + 0x10a000-0x112000. + + Using this parameter you can do debugging of EFI memmap + related feature. For example, you can do debugging of + Address Range Mirroring feature even if your box + doesn't support it. + eisa_irq_edge= [PARISC,HW] See header of drivers/parisc/eisa.c. diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c index fdb7f2a..30b4c44 100644 --- a/arch/x86/kernel/setup.c +++ b/arch/x86/kernel/setup.c @@ -1079,8 +1079,10 @@ void __init setup_arch(char **cmdline_p) memblock_set_current_limit(ISA_END_ADDRESS); memblock_x86_fill(); - if (efi_enabled(EFI_BOOT)) + if (efi_enabled(EFI_BOOT)) { + efi_fake_memmap(); efi_find_mirror(); + } /* * The EFI specification says that boot service code won't be called diff --git a/drivers/firmware/efi/Kconfig b/drivers/firmware/efi/Kconfig index 84533e0..ac47cc4d 100644 --- a/drivers/firmware/efi/Kconfig +++ b/drivers/firmware/efi/Kconfig @@ -52,6 +52,28 @@ config EFI_RUNTIME_MAP See also Documentation/ABI/testing/sysfs-firmware-efi-runtime-map. +config EFI_FAKE_MEMMAP + bool "Enable EFI fake memory map" + depends on EFI && X86 + default n + help + Saying Y here will enable "efi_fake_mem" boot option. + By specifying this parameter, you can add arbitrary attribute + to specific memory range by updating original (firmware provided) + EFI memmap. + This is useful for debugging of EFI memmap related feature. + e.g. Address Range Mirroring feature. + +config EFI_MAX_FAKE_MEM + int "maximum allowable number of ranges in efi_fake_mem boot option" + depends on EFI && X86 && EFI_FAKE_MEMMAP + range 1 128 + default 8 + help + Maximum allowable number of ranges in efi_fake_mem boot option. + Ranges can be set up to this value using comma-separated list. + The default value is 8. + config EFI_PARAMS_FROM_FDT bool help diff --git a/drivers/firmware/efi/Makefile b/drivers/firmware/efi/
[PATCH 0/2] Introduce "efi_fake_mem" boot option
UEFI spec 2.5 introduces new Memory Attribute Definition named EFI_MEMORY_MORE_RELIABLE which indicates which memory ranges are mirrored. Now linux kernel can recognize which memory ranges are mirrored by handling EFI_MEMORY_MORE_RELIABLE attributes. However testing this feature necesitates boxes with UEFI spec 2.5 complied firmware. This patchset introduces new boot option named "efi_fake_mem". By specifying this parameter, you can add arbitrary attribute to specific memory range. This is useful for debugging of Memory Address Range Mirroring feature. This is updated version one of the former patch posted at http://www.mail-archive.com/linux-efi@vger.kernel.org/msg05936.html changelog: - change boot option name and spec efi_fake_mem_mirror=nn@ss -> efi_fake_mem=nn@ss:aa - rename print_efi_memmap() to efi_print_memmap() - introduce new config named CONFIG_EFI_MAX_FAKE_MEM - and some fix pointed by Matt Flemming Taku Izumi (2): x86, efi: rename print_efi_memmap() to efi_print_memmap() x86, efi: Add "efi_fake_mem" boot option Documentation/kernel-parameters.txt | 15 +++ arch/x86/include/asm/efi.h | 1 + arch/x86/kernel/setup.c | 4 +- arch/x86/platform/efi/efi.c | 4 +- drivers/firmware/efi/Kconfig| 22 drivers/firmware/efi/Makefile | 1 + drivers/firmware/efi/fake_mem.c | 238 include/linux/efi.h | 6 + 8 files changed, 288 insertions(+), 3 deletions(-) create mode 100644 drivers/firmware/efi/fake_mem.c -- 1.8.3.1 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 2/2] x86, efi: Add "efi_fake_mem" boot option
This patch introduces new boot option named "efi_fake_mem". By specifying this parameter, you can add arbitrary attribute to specific memory range. This is useful for debugging of Address Range Mirroring feature. For example, if "efi_fake_mem=2G@4G:0x1,2G@0x10a000:0x1" is specified, the original (firmware provided) EFI memmap will be updated so that the specified memory regions have EFI_MEMORY_MORE_RELIABLE attribute (0x1): efi: mem36: [Conventional Memory| | | | | | |WB|WT|WC|UC] range=[0x0001-0x0020a000) (129536MB) efi: mem36: [Conventional Memory| |MR| | | | |WB|WT|WC|UC] range=[0x0001-0x00018000) (2048MB) efi: mem37: [Conventional Memory| | | | | | |WB|WT|WC|UC] range=[0x00018000-0x0010a000) (61952MB) efi: mem38: [Conventional Memory| |MR| | | | |WB|WT|WC|UC] range=[0x0010a000-0x00112000) (2048MB) efi: mem39: [Conventional Memory| | | | | | |WB|WT|WC|UC] range=[0x00112000-0x0020a000) (63488MB) And you will find that the following message is output: efi: Memory: 4096M/131455M mirrored memory Signed-off-by: Taku Izumi <izumi.t...@jp.fujitsu.com> --- Documentation/kernel-parameters.txt | 15 +++ arch/x86/kernel/setup.c | 4 +- drivers/firmware/efi/Kconfig| 22 drivers/firmware/efi/Makefile | 1 + drivers/firmware/efi/fake_mem.c | 238 include/linux/efi.h | 6 + 6 files changed, 285 insertions(+), 1 deletion(-) create mode 100644 drivers/firmware/efi/fake_mem.c diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt index 22a4b68..50fc09b 100644 --- a/Documentation/kernel-parameters.txt +++ b/Documentation/kernel-parameters.txt @@ -1094,6 +1094,21 @@ bytes respectively. Such letter suffixes can also be entirely omitted. you are really sure that your UEFI does sane gc and fulfills the spec otherwise your board may brick. + efi_fake_mem= nn[KMG]@ss[KMG]:aa[,nn[KMG]@ss[KMG]:aa,..] [EFI; X86] + Add arbitrary attribute to specific memory range by + updating original EFI memory map. + Region of memory which aa attribute is added to is + from ss to ss+nn. + If efi_fake_mem=2G@4G:0x1,2G@0x10a000:0x1 + is specified, EFI_MEMORY_MORE_RELIABLE(0x1) + attribute is added to range 0x1-0x18000 and + 0x10a000-0x112000. + + Using this parameter you can do debugging of EFI memmap + related feature. For example, you can do debugging of + Address Range Mirroring feature even if your box + doesn't support it. + eisa_irq_edge= [PARISC,HW] See header of drivers/parisc/eisa.c. diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c index fdb7f2a..30b4c44 100644 --- a/arch/x86/kernel/setup.c +++ b/arch/x86/kernel/setup.c @@ -1079,8 +1079,10 @@ void __init setup_arch(char **cmdline_p) memblock_set_current_limit(ISA_END_ADDRESS); memblock_x86_fill(); - if (efi_enabled(EFI_BOOT)) + if (efi_enabled(EFI_BOOT)) { + efi_fake_memmap(); efi_find_mirror(); + } /* * The EFI specification says that boot service code won't be called diff --git a/drivers/firmware/efi/Kconfig b/drivers/firmware/efi/Kconfig index 84533e0..ac47cc4d 100644 --- a/drivers/firmware/efi/Kconfig +++ b/drivers/firmware/efi/Kconfig @@ -52,6 +52,28 @@ config EFI_RUNTIME_MAP See also Documentation/ABI/testing/sysfs-firmware-efi-runtime-map. +config EFI_FAKE_MEMMAP + bool "Enable EFI fake memory map" + depends on EFI && X86 + default n + help + Saying Y here will enable "efi_fake_mem" boot option. + By specifying this parameter, you can add arbitrary attribute + to specific memory range by updating original (firmware provided) + EFI memmap. + This is useful for debugging of EFI memmap related feature. + e.g. Address Range Mirroring feature. + +config EFI_MAX_FAKE_MEM + int "maximum allowable number of ranges in efi_fake_mem boot option" + depends on EFI && X86 && EFI_FAKE_MEMMAP + range 1 128 + default 8 + help + Maximum allowable number of ranges in efi_fake_mem boot option. + Ranges can be set up to this value using comma-separated list. + The default value is 8. + config EFI_PARAMS_FROM_FDT bool help diff --git a/drivers/firmware/efi/Makefile b/drivers/firmware/efi/
RE: [PATCH 2/2] x86, efi: Add "efi_fake_mem" boot option
I've missed git-format-patch after rebasing. I'll resend right one.. > -Original Message- > From: kbuild test robot [mailto:l...@intel.com] > Sent: Wednesday, September 30, 2015 10:37 AM > To: Izumi, Taku/泉 拓 > Cc: kbuild-...@01.org; linux-kernel@vger.kernel.org; > linux-...@vger.kernel.org; x...@kernel.org; matt.flem...@intel.com; > t...@linutronix.de; mi...@redhat.com; h...@zytor.com; tony.l...@intel.com; > qinxi...@huawei.com; Kamezawa, Hiroyuki/亀 > 澤 寛之; ard.biesheu...@linaro.org; Izumi, Taku/泉 拓 > Subject: Re: [PATCH 2/2] x86, efi: Add "efi_fake_mem" boot option > > Hi Taku, > > [auto build test results on v4.3-rc3 -- if it's inappropriate base, please > ignore] > > config: i386-allmodconfig (attached as .config) > reproduce: > git checkout afcc94d3f91a00ce97d735a563a8e5d595f45a03 > # save the attached .config to linux build tree > make ARCH=i386 > > All error/warnings (new ones prefixed by >>): > > >> drivers/firmware/efi/fake_mem.c:36:25: error: 'CONFIG_EFI_MAX_FAKEMEM' > >> undeclared here (not in a function) > #define EFI_MAX_FAKEMEM CONFIG_EFI_MAX_FAKEMEM > ^ > >> drivers/firmware/efi/fake_mem.c:42:34: note: in expansion of macro > >> 'EFI_MAX_FAKEMEM' > static struct fake_mem fake_mems[EFI_MAX_FAKEMEM]; > ^ >drivers/firmware/efi/fake_mem.c: In function 'efi_fake_memmap': > >> drivers/firmware/efi/fake_mem.c:186:20: warning: cast to pointer from > >> integer of different size [-Wint-to-pointer-cast] > memmap.phys_map = (void *)new_memmap_phy; >^ >drivers/firmware/efi/fake_mem.c: At top level: > >> drivers/firmware/efi/fake_mem.c:42:24: warning: 'fake_mems' defined but > >> not used [-Wunused-variable] > static struct fake_mem fake_mems[EFI_MAX_FAKEMEM]; >^ > > vim +/CONFIG_EFI_MAX_FAKEMEM +36 drivers/firmware/efi/fake_mem.c > > 30#include > 31#include > 32#include > 33#include > 34#include > 35 > > 36#define EFI_MAX_FAKEMEM CONFIG_EFI_MAX_FAKEMEM > 37 > 38struct fake_mem { > 39struct range range; > 40u64 attribute; > 41}; > > 42static struct fake_mem fake_mems[EFI_MAX_FAKEMEM]; > 43static int nr_fake_mem; > 44 > 45static int __init cmp_fake_mem(const void *x1, const void *x2) > 46{ > 47const struct fake_mem *m1 = x1; > 48const struct fake_mem *m2 = x2; > 49 > 50if (m1->range.start < m2->range.start) > 51return -1; > 52if (m1->range.start > m2->range.start) > 53return 1; > 54return 0; > 55} > 56 > 57void __init efi_fake_memmap(void) > 58{ > 59u64 start, end, m_start, m_end, m_attr; > 60int new_nr_map = memmap.nr_map; > 61efi_memory_desc_t *md; > 62u64 new_memmap_phy; > 63void *new_memmap; > 64void *old, *new; > 65int i; > 66 > 67if (!nr_fake_mem || !efi_enabled(EFI_MEMMAP)) > 68return; > 69 > 70/* count up the number of EFI memory descriptor */ > 71for (old = memmap.map; old < memmap.map_end; old += > memmap.desc_size) { > 72md = old; > 73start = md->phys_addr; > 74end = start + (md->num_pages << EFI_PAGE_SHIFT) > - 1; > 75 > 76for (i = 0; i < nr_fake_mem; i++) { > 77/* modifying range */ > 78m_start = fake_mems[i].range.start; > 79m_end = fake_mems[i].range.end; > 80 > 81if (m_start <= start) { > 82/* split into 2 parts */ > 83if (start < m_end && m_end < > end) > 84new_nr_map++; > 85} > 86if (start < m_start && m_start < end) { > 87/
[PATCH 1/2] x86, efi: rename print_efi_memmap() to efi_print_memmap()
This patch renames print_efi_memmap() to efi_print_memmap() and make it global function so that we can invoke it outside of arch/x86/platform/efi/efi.c Signed-off-by: Taku Izumi <izumi.t...@jp.fujitsu.com> --- arch/x86/include/asm/efi.h | 1 + arch/x86/platform/efi/efi.c | 4 ++-- 2 files changed, 3 insertions(+), 2 deletions(-) diff --git a/arch/x86/include/asm/efi.h b/arch/x86/include/asm/efi.h index ab5f1d4..f8b93d6 100644 --- a/arch/x86/include/asm/efi.h +++ b/arch/x86/include/asm/efi.h @@ -103,6 +103,7 @@ extern void __init efi_set_executable(efi_memory_desc_t *md, bool executable); extern int __init efi_memblock_x86_reserve_range(void); extern pgd_t * __init efi_call_phys_prolog(void); extern void __init efi_call_phys_epilog(pgd_t *save_pgd); +extern void __init efi_print_memmap(void); extern void __init efi_unmap_memmap(void); extern void __init efi_memory_uc(u64 addr, unsigned long size); extern void __init efi_map_region(efi_memory_desc_t *md); diff --git a/arch/x86/platform/efi/efi.c b/arch/x86/platform/efi/efi.c index 1db84c0..1f95caf 100644 --- a/arch/x86/platform/efi/efi.c +++ b/arch/x86/platform/efi/efi.c @@ -222,7 +222,7 @@ int __init efi_memblock_x86_reserve_range(void) return 0; } -static void __init print_efi_memmap(void) +void __init efi_print_memmap(void) { #ifdef EFI_DEBUG efi_memory_desc_t *md; @@ -524,7 +524,7 @@ void __init efi_init(void) return; if (efi_enabled(EFI_DBG)) - print_efi_memmap(); + efi_print_memmap(); efi_esrt_init(); } -- 1.8.3.1 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v5][RESEND] perf, x86: Fix multi-segment problem of perf_event_intel_uncore
In multi-segment system, uncore devices may belong to buses whose segment number is other than 0. :ff:10.5 System peripheral: Intel Corporation Xeon E5 v3/Core i7 Scratchpad & Semaphore Registers (rev 03) ... 0001:7f:10.5 System peripheral: Intel Corporation Xeon E5 v3/Core i7 Scratchpad & Semaphore Registers (rev 03) ... 0001:bf:10.5 System peripheral: Intel Corporation Xeon E5 v3/Core i7 Scratchpad & Semaphore Registers (rev 03) ... 0001:ff:10.5 System peripheral: Intel Corporation Xeon E5 v3/Core i7 Scratchpad & Semaphore Registers (rev 03 ... In that case, relation of bus number and physical id may be broken because "uncore_pcibus_to_physid" doesn't take account of PCI segment. For example, bus :ff and 0001:ff uses the same entry of "uncore_pcibus_to_physid" array. This patch fixes ths problem by introducing segment-aware pci2phy_map instead. v4 -> v5: - Add initializaton code of pci2phy_map when newly alloced at __find_pci2phy_map() v3 -> v4: - avoid GFP_ATOMIC allocation at __find_pci2phy_map() - Add missing pci_dev_put at snb_pci2phy_map_init() - Add missing raw_spin_unlock at snbep_pci2phy_map_init() Signed-off-by: Taku Izumi --- arch/x86/kernel/cpu/perf_event_intel_uncore.c | 61 -- arch/x86/kernel/cpu/perf_event_intel_uncore.h | 12 - arch/x86/kernel/cpu/perf_event_intel_uncore_snb.c | 16 -- .../x86/kernel/cpu/perf_event_intel_uncore_snbep.c | 32 +--- 4 files changed, 106 insertions(+), 15 deletions(-) diff --git a/arch/x86/kernel/cpu/perf_event_intel_uncore.c b/arch/x86/kernel/cpu/perf_event_intel_uncore.c index 560e525..61215a6 100644 --- a/arch/x86/kernel/cpu/perf_event_intel_uncore.c +++ b/arch/x86/kernel/cpu/perf_event_intel_uncore.c @@ -7,7 +7,8 @@ struct intel_uncore_type **uncore_pci_uncores = empty_uncore; static bool pcidrv_registered; struct pci_driver *uncore_pci_driver; /* pci bus to socket mapping */ -int uncore_pcibus_to_physid[256] = { [0 ... 255] = -1, }; +DEFINE_RAW_SPINLOCK(pci2phy_map_lock); +struct list_head pci2phy_map_head = LIST_HEAD_INIT(pci2phy_map_head); struct pci_dev *uncore_extra_pci_dev[UNCORE_SOCKET_MAX][UNCORE_EXTRA_PCI_DEV_MAX]; static DEFINE_RAW_SPINLOCK(uncore_box_lock); @@ -20,6 +21,59 @@ static struct event_constraint uncore_constraint_fixed = struct event_constraint uncore_constraint_empty = EVENT_CONSTRAINT(0, 0, 0); +int uncore_pcibus_to_physid(struct pci_bus *bus) +{ + struct pci2phy_map *map; + int phys_id = -1; + + raw_spin_lock(_map_lock); + list_for_each_entry(map, _map_head, list) { + if (map->segment == pci_domain_nr(bus)) { + phys_id = map->pbus_to_physid[bus->number]; + break; + } + } + raw_spin_unlock(_map_lock); + + return phys_id; +} + +struct pci2phy_map *__find_pci2phy_map(int segment) +{ + struct pci2phy_map *map, *alloc = NULL; + int i; + + lockdep_assert_held(_map_lock); + +lookup: + list_for_each_entry(map, _map_head, list) { + if (map->segment == segment) + goto end; + } + + if (!alloc) { + raw_spin_unlock(_map_lock); + alloc = kmalloc(sizeof(struct pci2phy_map), GFP_KERNEL); + raw_spin_lock(_map_lock); + + if (!alloc) + return NULL; + + goto lookup; + } + + map = alloc; + alloc = NULL; + map->segment = segment; + for (i = 0; i < 256; i++) + map->pbus_to_physid[i] = -1; + list_add_tail(>list, _map_head); + +end: + kfree(alloc); + return map; +} + ssize_t uncore_event_show(struct kobject *kobj, struct kobj_attribute *attr, char *buf) { @@ -809,7 +863,7 @@ static int uncore_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id int phys_id; bool first_box = false; - phys_id = uncore_pcibus_to_physid[pdev->bus->number]; + phys_id = uncore_pcibus_to_physid(pdev->bus); if (phys_id < 0) return -ENODEV; @@ -856,9 +910,10 @@ static void uncore_pci_remove(struct pci_dev *pdev) { struct intel_uncore_box *box = pci_get_drvdata(pdev); struct intel_uncore_pmu *pmu; - int i, cpu, phys_id = uncore_pcibus_to_physid[pdev->bus->number]; + int i, cpu, phys_id; bool last_box = false; + phys_id = uncore_pcibus_to_physid(pdev->bus); box = pci_get_drvdata(pdev); if (!box) { for (i = 0; i < UNCORE_EXTRA_PCI_DEV_MAX; i++) { diff --git a/arch/x86/kernel/cpu/perf_event_intel_uncore.h b/arch/x86/kernel/cpu/perf_event_intel_uncore.h index 72c54c2..2f0a4a9 100644 --- a/arch/x86/kernel/cpu/perf_event_intel_uncore.h +++ b/arch/x86/kernel/cpu/perf_event
[PATCH v5][RESEND] perf, x86: Fix multi-segment problem of perf_event_intel_uncore
In multi-segment system, uncore devices may belong to buses whose segment number is other than 0. :ff:10.5 System peripheral: Intel Corporation Xeon E5 v3/Core i7 Scratchpad & Semaphore Registers (rev 03) ... 0001:7f:10.5 System peripheral: Intel Corporation Xeon E5 v3/Core i7 Scratchpad & Semaphore Registers (rev 03) ... 0001:bf:10.5 System peripheral: Intel Corporation Xeon E5 v3/Core i7 Scratchpad & Semaphore Registers (rev 03) ... 0001:ff:10.5 System peripheral: Intel Corporation Xeon E5 v3/Core i7 Scratchpad & Semaphore Registers (rev 03 ... In that case, relation of bus number and physical id may be broken because "uncore_pcibus_to_physid" doesn't take account of PCI segment. For example, bus :ff and 0001:ff uses the same entry of "uncore_pcibus_to_physid" array. This patch fixes ths problem by introducing segment-aware pci2phy_map instead. v4 -> v5: - Add initializaton code of pci2phy_map when newly alloced at __find_pci2phy_map() v3 -> v4: - avoid GFP_ATOMIC allocation at __find_pci2phy_map() - Add missing pci_dev_put at snb_pci2phy_map_init() - Add missing raw_spin_unlock at snbep_pci2phy_map_init() Signed-off-by: Taku Izumi <izumi.t...@jp.fujitsu.com> --- arch/x86/kernel/cpu/perf_event_intel_uncore.c | 61 -- arch/x86/kernel/cpu/perf_event_intel_uncore.h | 12 - arch/x86/kernel/cpu/perf_event_intel_uncore_snb.c | 16 -- .../x86/kernel/cpu/perf_event_intel_uncore_snbep.c | 32 +--- 4 files changed, 106 insertions(+), 15 deletions(-) diff --git a/arch/x86/kernel/cpu/perf_event_intel_uncore.c b/arch/x86/kernel/cpu/perf_event_intel_uncore.c index 560e525..61215a6 100644 --- a/arch/x86/kernel/cpu/perf_event_intel_uncore.c +++ b/arch/x86/kernel/cpu/perf_event_intel_uncore.c @@ -7,7 +7,8 @@ struct intel_uncore_type **uncore_pci_uncores = empty_uncore; static bool pcidrv_registered; struct pci_driver *uncore_pci_driver; /* pci bus to socket mapping */ -int uncore_pcibus_to_physid[256] = { [0 ... 255] = -1, }; +DEFINE_RAW_SPINLOCK(pci2phy_map_lock); +struct list_head pci2phy_map_head = LIST_HEAD_INIT(pci2phy_map_head); struct pci_dev *uncore_extra_pci_dev[UNCORE_SOCKET_MAX][UNCORE_EXTRA_PCI_DEV_MAX]; static DEFINE_RAW_SPINLOCK(uncore_box_lock); @@ -20,6 +21,59 @@ static struct event_constraint uncore_constraint_fixed = struct event_constraint uncore_constraint_empty = EVENT_CONSTRAINT(0, 0, 0); +int uncore_pcibus_to_physid(struct pci_bus *bus) +{ + struct pci2phy_map *map; + int phys_id = -1; + + raw_spin_lock(_map_lock); + list_for_each_entry(map, _map_head, list) { + if (map->segment == pci_domain_nr(bus)) { + phys_id = map->pbus_to_physid[bus->number]; + break; + } + } + raw_spin_unlock(_map_lock); + + return phys_id; +} + +struct pci2phy_map *__find_pci2phy_map(int segment) +{ + struct pci2phy_map *map, *alloc = NULL; + int i; + + lockdep_assert_held(_map_lock); + +lookup: + list_for_each_entry(map, _map_head, list) { + if (map->segment == segment) + goto end; + } + + if (!alloc) { + raw_spin_unlock(_map_lock); + alloc = kmalloc(sizeof(struct pci2phy_map), GFP_KERNEL); + raw_spin_lock(_map_lock); + + if (!alloc) + return NULL; + + goto lookup; + } + + map = alloc; + alloc = NULL; + map->segment = segment; + for (i = 0; i < 256; i++) + map->pbus_to_physid[i] = -1; + list_add_tail(>list, _map_head); + +end: + kfree(alloc); + return map; +} + ssize_t uncore_event_show(struct kobject *kobj, struct kobj_attribute *attr, char *buf) { @@ -809,7 +863,7 @@ static int uncore_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id int phys_id; bool first_box = false; - phys_id = uncore_pcibus_to_physid[pdev->bus->number]; + phys_id = uncore_pcibus_to_physid(pdev->bus); if (phys_id < 0) return -ENODEV; @@ -856,9 +910,10 @@ static void uncore_pci_remove(struct pci_dev *pdev) { struct intel_uncore_box *box = pci_get_drvdata(pdev); struct intel_uncore_pmu *pmu; - int i, cpu, phys_id = uncore_pcibus_to_physid[pdev->bus->number]; + int i, cpu, phys_id; bool last_box = false; + phys_id = uncore_pcibus_to_physid(pdev->bus); box = pci_get_drvdata(pdev); if (!box) { for (i = 0; i < UNCORE_EXTRA_PCI_DEV_MAX; i++) { diff --git a/arch/x86/kernel/cpu/perf_event_intel_uncore.h b/arch/x86/kernel/cpu/perf_event_intel_uncore.h index 72c54c2..2f0a4a9 100644 --- a/arch/x86/kernel/cpu/perf_event_intel_uncore.h +
[PATCH v4] perf, x86: Fix multi-segment problem of perf_event_intel_uncore
In multi-segment system, uncore devices may belong to buses whose segment number is other than 0. :ff:10.5 System peripheral: Intel Corporation Xeon E5 v3/Core i7 Scratchpad & Semaphore Registers (rev 03) ... 0001:7f:10.5 System peripheral: Intel Corporation Xeon E5 v3/Core i7 Scratchpad & Semaphore Registers (rev 03) ... 0001:bf:10.5 System peripheral: Intel Corporation Xeon E5 v3/Core i7 Scratchpad & Semaphore Registers (rev 03) ... 0001:ff:10.5 System peripheral: Intel Corporation Xeon E5 v3/Core i7 Scratchpad & Semaphore Registers (rev 03 ... In that case, relation of bus number and physical id may be broken because "uncore_pcibus_to_physid" doesn't take account of PCI segment. For example, bus :ff and 0001:ff uses the same entry of "uncore_pcibus_to_physid" array. This patch fixes ths problem by introducing segment-aware pci2phy_map instead. v3 -> v4: - avoid GFP_ATOMIC allocation at __find_pci2phy_map() - Add missing pci_dev_put at snb_pci2phy_map_init() - Add missing raw_spin_unlock at snbep_pci2phy_map_init() Signed-off-by: Taku Izumi --- arch/x86/kernel/cpu/perf_event_intel_uncore.c | 58 -- arch/x86/kernel/cpu/perf_event_intel_uncore.h | 12 - arch/x86/kernel/cpu/perf_event_intel_uncore_snb.c | 16 -- .../x86/kernel/cpu/perf_event_intel_uncore_snbep.c | 32 +--- 4 files changed, 103 insertions(+), 15 deletions(-) diff --git a/arch/x86/kernel/cpu/perf_event_intel_uncore.c b/arch/x86/kernel/cpu/perf_event_intel_uncore.c index 560e525..3fba445 100644 --- a/arch/x86/kernel/cpu/perf_event_intel_uncore.c +++ b/arch/x86/kernel/cpu/perf_event_intel_uncore.c @@ -7,7 +7,8 @@ struct intel_uncore_type **uncore_pci_uncores = empty_uncore; static bool pcidrv_registered; struct pci_driver *uncore_pci_driver; /* pci bus to socket mapping */ -int uncore_pcibus_to_physid[256] = { [0 ... 255] = -1, }; +DEFINE_RAW_SPINLOCK(pci2phy_map_lock); +struct list_head pci2phy_map_head = LIST_HEAD_INIT(pci2phy_map_head); struct pci_dev *uncore_extra_pci_dev[UNCORE_SOCKET_MAX][UNCORE_EXTRA_PCI_DEV_MAX]; static DEFINE_RAW_SPINLOCK(uncore_box_lock); @@ -20,6 +21,56 @@ static struct event_constraint uncore_constraint_fixed = struct event_constraint uncore_constraint_empty = EVENT_CONSTRAINT(0, 0, 0); +int uncore_pcibus_to_physid(struct pci_bus *bus) +{ + struct pci2phy_map *map; + int phys_id = -1; + + raw_spin_lock(_map_lock); + list_for_each_entry(map, _map_head, list) { + if (map->segment == pci_domain_nr(bus)) { + phys_id = map->pbus_to_physid[bus->number]; + break; + } + } + raw_spin_unlock(_map_lock); + + return phys_id; +} + +struct pci2phy_map *__find_pci2phy_map(int segment) +{ + struct pci2phy_map *map, *alloc = NULL; + + lockdep_assert_held(_map_lock); + +lookup: + list_for_each_entry(map, _map_head, list) { + if (map->segment == segment) + goto end; + } + + if (!alloc) { + raw_spin_unlock(_map_lock); + alloc = kmalloc(sizeof(struct pci2phy_map), GFP_KERNEL); + raw_spin_lock(_map_lock); + + if (!alloc) + return NULL; + + goto lookup; + } + + map = alloc; + alloc = NULL; + map->segment = segment; + list_add_tail(>list, _map_head); + +end: + kfree(alloc); + return map; +} + ssize_t uncore_event_show(struct kobject *kobj, struct kobj_attribute *attr, char *buf) { @@ -809,7 +860,7 @@ static int uncore_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id int phys_id; bool first_box = false; - phys_id = uncore_pcibus_to_physid[pdev->bus->number]; + phys_id = uncore_pcibus_to_physid(pdev->bus); if (phys_id < 0) return -ENODEV; @@ -856,9 +907,10 @@ static void uncore_pci_remove(struct pci_dev *pdev) { struct intel_uncore_box *box = pci_get_drvdata(pdev); struct intel_uncore_pmu *pmu; - int i, cpu, phys_id = uncore_pcibus_to_physid[pdev->bus->number]; + int i, cpu, phys_id; bool last_box = false; + phys_id = uncore_pcibus_to_physid(pdev->bus); box = pci_get_drvdata(pdev); if (!box) { for (i = 0; i < UNCORE_EXTRA_PCI_DEV_MAX; i++) { diff --git a/arch/x86/kernel/cpu/perf_event_intel_uncore.h b/arch/x86/kernel/cpu/perf_event_intel_uncore.h index 72c54c2..2f0a4a9 100644 --- a/arch/x86/kernel/cpu/perf_event_intel_uncore.h +++ b/arch/x86/kernel/cpu/perf_event_intel_uncore.h @@ -117,6 +117,15 @@ struct uncore_event_desc { const char *config; }; +struct pci2phy_map { + struct list_head list; + int segment; + int pbus_to_physid[2
[PATCH v4] perf, x86: Fix multi-segment problem of perf_event_intel_uncore
In multi-segment system, uncore devices may belong to buses whose segment number is other than 0. :ff:10.5 System peripheral: Intel Corporation Xeon E5 v3/Core i7 Scratchpad & Semaphore Registers (rev 03) ... 0001:7f:10.5 System peripheral: Intel Corporation Xeon E5 v3/Core i7 Scratchpad & Semaphore Registers (rev 03) ... 0001:bf:10.5 System peripheral: Intel Corporation Xeon E5 v3/Core i7 Scratchpad & Semaphore Registers (rev 03) ... 0001:ff:10.5 System peripheral: Intel Corporation Xeon E5 v3/Core i7 Scratchpad & Semaphore Registers (rev 03 ... In that case, relation of bus number and physical id may be broken because "uncore_pcibus_to_physid" doesn't take account of PCI segment. For example, bus :ff and 0001:ff uses the same entry of "uncore_pcibus_to_physid" array. This patch fixes ths problem by introducing segment-aware pci2phy_map instead. v3 -> v4: - avoid GFP_ATOMIC allocation at __find_pci2phy_map() - Add missing pci_dev_put at snb_pci2phy_map_init() - Add missing raw_spin_unlock at snbep_pci2phy_map_init() Signed-off-by: Taku Izumi <izumi.t...@jp.fujitsu.com> --- arch/x86/kernel/cpu/perf_event_intel_uncore.c | 58 -- arch/x86/kernel/cpu/perf_event_intel_uncore.h | 12 - arch/x86/kernel/cpu/perf_event_intel_uncore_snb.c | 16 -- .../x86/kernel/cpu/perf_event_intel_uncore_snbep.c | 32 +--- 4 files changed, 103 insertions(+), 15 deletions(-) diff --git a/arch/x86/kernel/cpu/perf_event_intel_uncore.c b/arch/x86/kernel/cpu/perf_event_intel_uncore.c index 560e525..3fba445 100644 --- a/arch/x86/kernel/cpu/perf_event_intel_uncore.c +++ b/arch/x86/kernel/cpu/perf_event_intel_uncore.c @@ -7,7 +7,8 @@ struct intel_uncore_type **uncore_pci_uncores = empty_uncore; static bool pcidrv_registered; struct pci_driver *uncore_pci_driver; /* pci bus to socket mapping */ -int uncore_pcibus_to_physid[256] = { [0 ... 255] = -1, }; +DEFINE_RAW_SPINLOCK(pci2phy_map_lock); +struct list_head pci2phy_map_head = LIST_HEAD_INIT(pci2phy_map_head); struct pci_dev *uncore_extra_pci_dev[UNCORE_SOCKET_MAX][UNCORE_EXTRA_PCI_DEV_MAX]; static DEFINE_RAW_SPINLOCK(uncore_box_lock); @@ -20,6 +21,56 @@ static struct event_constraint uncore_constraint_fixed = struct event_constraint uncore_constraint_empty = EVENT_CONSTRAINT(0, 0, 0); +int uncore_pcibus_to_physid(struct pci_bus *bus) +{ + struct pci2phy_map *map; + int phys_id = -1; + + raw_spin_lock(_map_lock); + list_for_each_entry(map, _map_head, list) { + if (map->segment == pci_domain_nr(bus)) { + phys_id = map->pbus_to_physid[bus->number]; + break; + } + } + raw_spin_unlock(_map_lock); + + return phys_id; +} + +struct pci2phy_map *__find_pci2phy_map(int segment) +{ + struct pci2phy_map *map, *alloc = NULL; + + lockdep_assert_held(_map_lock); + +lookup: + list_for_each_entry(map, _map_head, list) { + if (map->segment == segment) + goto end; + } + + if (!alloc) { + raw_spin_unlock(_map_lock); + alloc = kmalloc(sizeof(struct pci2phy_map), GFP_KERNEL); + raw_spin_lock(_map_lock); + + if (!alloc) + return NULL; + + goto lookup; + } + + map = alloc; + alloc = NULL; + map->segment = segment; + list_add_tail(>list, _map_head); + +end: + kfree(alloc); + return map; +} + ssize_t uncore_event_show(struct kobject *kobj, struct kobj_attribute *attr, char *buf) { @@ -809,7 +860,7 @@ static int uncore_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id int phys_id; bool first_box = false; - phys_id = uncore_pcibus_to_physid[pdev->bus->number]; + phys_id = uncore_pcibus_to_physid(pdev->bus); if (phys_id < 0) return -ENODEV; @@ -856,9 +907,10 @@ static void uncore_pci_remove(struct pci_dev *pdev) { struct intel_uncore_box *box = pci_get_drvdata(pdev); struct intel_uncore_pmu *pmu; - int i, cpu, phys_id = uncore_pcibus_to_physid[pdev->bus->number]; + int i, cpu, phys_id; bool last_box = false; + phys_id = uncore_pcibus_to_physid(pdev->bus); box = pci_get_drvdata(pdev); if (!box) { for (i = 0; i < UNCORE_EXTRA_PCI_DEV_MAX; i++) { diff --git a/arch/x86/kernel/cpu/perf_event_intel_uncore.h b/arch/x86/kernel/cpu/perf_event_intel_uncore.h index 72c54c2..2f0a4a9 100644 --- a/arch/x86/kernel/cpu/perf_event_intel_uncore.h +++ b/arch/x86/kernel/cpu/perf_event_intel_uncore.h @@ -117,6 +117,15 @@ struct uncore_event_desc { const char *config; }; +struct pci2phy_map { + struct list_head list; + int segm
[PATCH v3] perf, x86: Fix multi-segment problem of perf_event_intel_uncore
In multi-segment system, uncore devices may belong to buses whose segment number is other than 0. :ff:10.5 System peripheral: Intel Corporation Xeon E5 v3/Core i7 Scratchpad & Semaphore Registers (rev 03) ... 0001:7f:10.5 System peripheral: Intel Corporation Xeon E5 v3/Core i7 Scratchpad & Semaphore Registers (rev 03) ... 0001:bf:10.5 System peripheral: Intel Corporation Xeon E5 v3/Core i7 Scratchpad & Semaphore Registers (rev 03) ... 0001:ff:10.5 System peripheral: Intel Corporation Xeon E5 v3/Core i7 Scratchpad & Semaphore Registers (rev 03 ... In that case relation of bus number and physical id may be broken because "uncore_pcibus_to_physid" doesn't take account of PCI segment. For example, bus :ff and 0001:ff uses the same entry of "uncore_pcibus_to_physid" array. This patch fixes ths problem by introducing segment-aware pci2phy_map instead. v2 -> v3: - fix up according to Peter's comment - introduce __find_pci2phy_map() to avert repetition Signed-off-by: Taku Izumi --- arch/x86/kernel/cpu/perf_event_intel_uncore.c | 45 -- arch/x86/kernel/cpu/perf_event_intel_uncore.h | 12 +- arch/x86/kernel/cpu/perf_event_intel_uncore_snb.c | 13 ++- .../x86/kernel/cpu/perf_event_intel_uncore_snbep.c | 31 +++ 4 files changed, 87 insertions(+), 14 deletions(-) diff --git a/arch/x86/kernel/cpu/perf_event_intel_uncore.c b/arch/x86/kernel/cpu/perf_event_intel_uncore.c index 560e525..1ddac35 100644 --- a/arch/x86/kernel/cpu/perf_event_intel_uncore.c +++ b/arch/x86/kernel/cpu/perf_event_intel_uncore.c @@ -7,7 +7,8 @@ struct intel_uncore_type **uncore_pci_uncores = empty_uncore; static bool pcidrv_registered; struct pci_driver *uncore_pci_driver; /* pci bus to socket mapping */ -int uncore_pcibus_to_physid[256] = { [0 ... 255] = -1, }; +DEFINE_RAW_SPINLOCK(pci2phy_map_lock); +struct list_head pci2phy_map_head = LIST_HEAD_INIT(pci2phy_map_head); struct pci_dev *uncore_extra_pci_dev[UNCORE_SOCKET_MAX][UNCORE_EXTRA_PCI_DEV_MAX]; static DEFINE_RAW_SPINLOCK(uncore_box_lock); @@ -20,6 +21,43 @@ static struct event_constraint uncore_constraint_fixed = struct event_constraint uncore_constraint_empty = EVENT_CONSTRAINT(0, 0, 0); +int uncore_pcibus_to_physid(struct pci_bus *bus) +{ + struct pci2phy_map *map; + int phys_id = -1; + + raw_spin_lock(_map_lock); + list_for_each_entry(map, _map_head, list) { + if (map->segment == pci_domain_nr(bus)) { + phys_id = map->pbus_to_physid[bus->number]; + break; + } + } + raw_spin_unlock(_map_lock); + + return phys_id; +} + +struct pci2phy_map *__find_pci2phy_map(int segment) +{ + struct pci2phy_map *map; + + lockdep_assert_held(_map_lock); + + list_for_each_entry(map, _map_head, list) { + if (map->segment == segment) + return map; + } + + map = kmalloc(sizeof(struct pci2phy_map), GFP_ATOMIC); + if (map) { + map->segment = segment; + list_add_tail(>list, _map_head); + } + + return map; +} + ssize_t uncore_event_show(struct kobject *kobj, struct kobj_attribute *attr, char *buf) { @@ -809,7 +847,7 @@ static int uncore_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id int phys_id; bool first_box = false; - phys_id = uncore_pcibus_to_physid[pdev->bus->number]; + phys_id = uncore_pcibus_to_physid(pdev->bus); if (phys_id < 0) return -ENODEV; @@ -856,9 +894,10 @@ static void uncore_pci_remove(struct pci_dev *pdev) { struct intel_uncore_box *box = pci_get_drvdata(pdev); struct intel_uncore_pmu *pmu; - int i, cpu, phys_id = uncore_pcibus_to_physid[pdev->bus->number]; + int i, cpu, phys_id; bool last_box = false; + phys_id = uncore_pcibus_to_physid(pdev->bus); box = pci_get_drvdata(pdev); if (!box) { for (i = 0; i < UNCORE_EXTRA_PCI_DEV_MAX; i++) { diff --git a/arch/x86/kernel/cpu/perf_event_intel_uncore.h b/arch/x86/kernel/cpu/perf_event_intel_uncore.h index 72c54c2..2f0a4a9 100644 --- a/arch/x86/kernel/cpu/perf_event_intel_uncore.h +++ b/arch/x86/kernel/cpu/perf_event_intel_uncore.h @@ -117,6 +117,15 @@ struct uncore_event_desc { const char *config; }; +struct pci2phy_map { + struct list_head list; + int segment; + int pbus_to_physid[256]; +}; + +int uncore_pcibus_to_physid(struct pci_bus *bus); +struct pci2phy_map *__find_pci2phy_map(int segment); + ssize_t uncore_event_show(struct kobject *kobj, struct kobj_attribute *attr, char *buf); @@ -317,7 +326,8 @@ u64 uncore_shared_reg_config(struct intel_uncore_box *box, int idx); exter
[PATCH v3] perf, x86: Fix multi-segment problem of perf_event_intel_uncore
In multi-segment system, uncore devices may belong to buses whose segment number is other than 0. :ff:10.5 System peripheral: Intel Corporation Xeon E5 v3/Core i7 Scratchpad & Semaphore Registers (rev 03) ... 0001:7f:10.5 System peripheral: Intel Corporation Xeon E5 v3/Core i7 Scratchpad & Semaphore Registers (rev 03) ... 0001:bf:10.5 System peripheral: Intel Corporation Xeon E5 v3/Core i7 Scratchpad & Semaphore Registers (rev 03) ... 0001:ff:10.5 System peripheral: Intel Corporation Xeon E5 v3/Core i7 Scratchpad & Semaphore Registers (rev 03 ... In that case relation of bus number and physical id may be broken because "uncore_pcibus_to_physid" doesn't take account of PCI segment. For example, bus :ff and 0001:ff uses the same entry of "uncore_pcibus_to_physid" array. This patch fixes ths problem by introducing segment-aware pci2phy_map instead. v2 -> v3: - fix up according to Peter's comment - introduce __find_pci2phy_map() to avert repetition Signed-off-by: Taku Izumi <izumi.t...@jp.fujitsu.com> --- arch/x86/kernel/cpu/perf_event_intel_uncore.c | 45 -- arch/x86/kernel/cpu/perf_event_intel_uncore.h | 12 +- arch/x86/kernel/cpu/perf_event_intel_uncore_snb.c | 13 ++- .../x86/kernel/cpu/perf_event_intel_uncore_snbep.c | 31 +++ 4 files changed, 87 insertions(+), 14 deletions(-) diff --git a/arch/x86/kernel/cpu/perf_event_intel_uncore.c b/arch/x86/kernel/cpu/perf_event_intel_uncore.c index 560e525..1ddac35 100644 --- a/arch/x86/kernel/cpu/perf_event_intel_uncore.c +++ b/arch/x86/kernel/cpu/perf_event_intel_uncore.c @@ -7,7 +7,8 @@ struct intel_uncore_type **uncore_pci_uncores = empty_uncore; static bool pcidrv_registered; struct pci_driver *uncore_pci_driver; /* pci bus to socket mapping */ -int uncore_pcibus_to_physid[256] = { [0 ... 255] = -1, }; +DEFINE_RAW_SPINLOCK(pci2phy_map_lock); +struct list_head pci2phy_map_head = LIST_HEAD_INIT(pci2phy_map_head); struct pci_dev *uncore_extra_pci_dev[UNCORE_SOCKET_MAX][UNCORE_EXTRA_PCI_DEV_MAX]; static DEFINE_RAW_SPINLOCK(uncore_box_lock); @@ -20,6 +21,43 @@ static struct event_constraint uncore_constraint_fixed = struct event_constraint uncore_constraint_empty = EVENT_CONSTRAINT(0, 0, 0); +int uncore_pcibus_to_physid(struct pci_bus *bus) +{ + struct pci2phy_map *map; + int phys_id = -1; + + raw_spin_lock(_map_lock); + list_for_each_entry(map, _map_head, list) { + if (map->segment == pci_domain_nr(bus)) { + phys_id = map->pbus_to_physid[bus->number]; + break; + } + } + raw_spin_unlock(_map_lock); + + return phys_id; +} + +struct pci2phy_map *__find_pci2phy_map(int segment) +{ + struct pci2phy_map *map; + + lockdep_assert_held(_map_lock); + + list_for_each_entry(map, _map_head, list) { + if (map->segment == segment) + return map; + } + + map = kmalloc(sizeof(struct pci2phy_map), GFP_ATOMIC); + if (map) { + map->segment = segment; + list_add_tail(>list, _map_head); + } + + return map; +} + ssize_t uncore_event_show(struct kobject *kobj, struct kobj_attribute *attr, char *buf) { @@ -809,7 +847,7 @@ static int uncore_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id int phys_id; bool first_box = false; - phys_id = uncore_pcibus_to_physid[pdev->bus->number]; + phys_id = uncore_pcibus_to_physid(pdev->bus); if (phys_id < 0) return -ENODEV; @@ -856,9 +894,10 @@ static void uncore_pci_remove(struct pci_dev *pdev) { struct intel_uncore_box *box = pci_get_drvdata(pdev); struct intel_uncore_pmu *pmu; - int i, cpu, phys_id = uncore_pcibus_to_physid[pdev->bus->number]; + int i, cpu, phys_id; bool last_box = false; + phys_id = uncore_pcibus_to_physid(pdev->bus); box = pci_get_drvdata(pdev); if (!box) { for (i = 0; i < UNCORE_EXTRA_PCI_DEV_MAX; i++) { diff --git a/arch/x86/kernel/cpu/perf_event_intel_uncore.h b/arch/x86/kernel/cpu/perf_event_intel_uncore.h index 72c54c2..2f0a4a9 100644 --- a/arch/x86/kernel/cpu/perf_event_intel_uncore.h +++ b/arch/x86/kernel/cpu/perf_event_intel_uncore.h @@ -117,6 +117,15 @@ struct uncore_event_desc { const char *config; }; +struct pci2phy_map { + struct list_head list; + int segment; + int pbus_to_physid[256]; +}; + +int uncore_pcibus_to_physid(struct pci_bus *bus); +struct pci2phy_map *__find_pci2phy_map(int segment); + ssize_t uncore_event_show(struct kobject *kobj, struct kobj_attribute *attr, char *buf); @@ -317,7 +326,8 @@ u64 uncore_shared_reg_config(struct intel_uncore_box
[PATCH v2][RESEND] perf, x86: Fix multi-segment problem of perf_event_intel_uncore
In multi-segment system, uncore devices may belong to buses whose segment number is other than 0. :ff:10.5 System peripheral: Intel Corporation Xeon E5 v3/Core i7 Scratchpad & Semaphore Registers (rev 03) ... 0001:7f:10.5 System peripheral: Intel Corporation Xeon E5 v3/Core i7 Scratchpad & Semaphore Registers (rev 03) ... 0001:bf:10.5 System peripheral: Intel Corporation Xeon E5 v3/Core i7 Scratchpad & Semaphore Registers (rev 03) ... 0001:ff:10.5 System peripheral: Intel Corporation Xeon E5 v3/Core i7 Scratchpad & Semaphore Registers (rev 03 ... In that case relation of bus number and physical id may be broken because "uncore_pcibus_to_physid" doesn't take account of PCI segment. For example, bus :ff and 0001:ff uses the same entry of "uncore_pcibus_to_physid" array. This patch fixes ths problem by introducing segment-aware pci2phy_map instead. v1 -> v2: - Extract method named uncore_pcibus_to_physid to avoid repetetion of retrieving phys_id code Signed-off-by: Taku Izumi --- arch/x86/kernel/cpu/perf_event_intel_uncore.c | 25 -- arch/x86/kernel/cpu/perf_event_intel_uncore.h | 11 - arch/x86/kernel/cpu/perf_event_intel_uncore_snb.c | 23 +- .../x86/kernel/cpu/perf_event_intel_uncore_snbep.c | 53 -- 4 files changed, 94 insertions(+), 18 deletions(-) diff --git a/arch/x86/kernel/cpu/perf_event_intel_uncore.c b/arch/x86/kernel/cpu/perf_event_intel_uncore.c index 21b5e38..0ed6f2b 100644 --- a/arch/x86/kernel/cpu/perf_event_intel_uncore.c +++ b/arch/x86/kernel/cpu/perf_event_intel_uncore.c @@ -7,7 +7,8 @@ struct intel_uncore_type **uncore_pci_uncores = empty_uncore; static bool pcidrv_registered; struct pci_driver *uncore_pci_driver; /* pci bus to socket mapping */ -int uncore_pcibus_to_physid[256] = { [0 ... 255] = -1, }; +DEFINE_RAW_SPINLOCK(pci2phy_map_lock); +struct list_head pci2phy_map_head = LIST_HEAD_INIT(pci2phy_map_head); struct pci_dev *uncore_extra_pci_dev[UNCORE_SOCKET_MAX][UNCORE_EXTRA_PCI_DEV_MAX]; static DEFINE_RAW_SPINLOCK(uncore_box_lock); @@ -20,6 +21,23 @@ static struct event_constraint uncore_constraint_fixed = struct event_constraint uncore_constraint_empty = EVENT_CONSTRAINT(0, 0, 0); +int uncore_pcibus_to_physid(struct pci_bus *bus) +{ + int phys_id = -1; + struct pci2phy_map *map; + + raw_spin_lock(_map_lock); + list_for_each_entry(map, _map_head, list) { + if (map->segment == pci_domain_nr(bus)) { + phys_id = map->pbus_to_physid[bus->number]; + break; + } + } + raw_spin_unlock(_map_lock); + + return phys_id; +} + ssize_t uncore_event_show(struct kobject *kobj, struct kobj_attribute *attr, char *buf) { @@ -809,7 +827,7 @@ static int uncore_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id int phys_id; bool first_box = false; - phys_id = uncore_pcibus_to_physid[pdev->bus->number]; + phys_id = uncore_pcibus_to_physid(pdev->bus); if (phys_id < 0) return -ENODEV; @@ -856,9 +874,10 @@ static void uncore_pci_remove(struct pci_dev *pdev) { struct intel_uncore_box *box = pci_get_drvdata(pdev); struct intel_uncore_pmu *pmu; - int i, cpu, phys_id = uncore_pcibus_to_physid[pdev->bus->number]; + int i, cpu, phys_id; bool last_box = false; + phys_id = uncore_pcibus_to_physid(pdev->bus); box = pci_get_drvdata(pdev); if (!box) { for (i = 0; i < UNCORE_EXTRA_PCI_DEV_MAX; i++) { diff --git a/arch/x86/kernel/cpu/perf_event_intel_uncore.h b/arch/x86/kernel/cpu/perf_event_intel_uncore.h index 0f77f0a..6c96ee9 100644 --- a/arch/x86/kernel/cpu/perf_event_intel_uncore.h +++ b/arch/x86/kernel/cpu/perf_event_intel_uncore.h @@ -117,6 +117,14 @@ struct uncore_event_desc { const char *config; }; +struct pci2phy_map { + struct list_head list; + int segment; + int pbus_to_physid[256]; +}; + +int uncore_pcibus_to_physid(struct pci_bus *bus); + ssize_t uncore_event_show(struct kobject *kobj, struct kobj_attribute *attr, char *buf); @@ -317,7 +325,8 @@ u64 uncore_shared_reg_config(struct intel_uncore_box *box, int idx); extern struct intel_uncore_type **uncore_msr_uncores; extern struct intel_uncore_type **uncore_pci_uncores; extern struct pci_driver *uncore_pci_driver; -extern int uncore_pcibus_to_physid[256]; +extern raw_spinlock_t pci2phy_map_lock; +extern struct list_head pci2phy_map_head; extern struct pci_dev *uncore_extra_pci_dev[UNCORE_SOCKET_MAX][UNCORE_EXTRA_PCI_DEV_MAX]; extern struct event_constraint uncore_constraint_empty; diff --git a/arch/x86/kernel/cpu/perf_event_intel_uncore_snb.c b/arch/x86/kernel/cpu/perf_event_intel_uncore_snb.c index b005a78..ccbc817 1
RE: [PATCH 2/2] x86, efi: Add "efi_fake_mem_mirror" boot option
Dear Matt, Thank you for reviewing. I updated my patchset. I'm happy if you review new one. Sincerely, Taku Izumi > -Original Message- > From: Matt Fleming [mailto:m...@codeblueprint.co.uk] > Sent: Wednesday, August 26, 2015 8:46 AM > To: Izumi, Taku/泉 拓 > Cc: linux-kernel@vger.kernel.org; linux-...@vger.kernel.org; x...@kernel.org; > matt.flem...@intel.com; > t...@linutronix.de; mi...@redhat.com; h...@zytor.com; tony.l...@intel.com; > qiuxi...@huawei.com; Kamezawa, Hiroyuki/亀 > 澤 寛之 > Subject: Re: [PATCH 2/2] x86, efi: Add "efi_fake_mem_mirror" boot option > > On Fri, 21 Aug, at 02:16:00AM, Taku Izumi wrote: > > This patch introduces new boot option named "efi_fake_mem_mirror". > > By specifying this parameter, you can mark specific memory as > > mirrored memory. This is useful for debugging of Address Range > > Mirroring feature. > > > > For example, if you specify "efi_fake_mem_mirror=2G@4G,2G@0x10a000", > > the original (firmware provided) EFI memmap will be updated so that > > the specified memory regions have EFI_MEMORY_MORE_RELIABLE attribute: > > > > > >efi: mem00: [Boot Data | || | | | |WB|WT|WC|UC] > > range=[0x-0x1000) > (0MB) > >efi: mem01: [Loader Data| || | | | |WB|WT|WC|UC] > > range=[0x1000-0x2000) > (0MB) > >... > >efi: mem35: [Boot Data | || | | | |WB|WT|WC|UC] > > range=[0x47ee6000-0x48014000) > (1MB) > >efi: mem36: [Conventional Memory| || | | | |WB|WT|WC|UC] > > range=[0x0001-0x0020a000) > (129536MB) > >efi: mem37: [Reserved |RUN|| | | | | | | |UC] > > range=[0x6000-0x9000) > (768MB) > > > > > >efi: mem00: [Boot Data | || | | | |WB|WT|WC|UC] > > range=[0x-0x1000) > (0MB) > >efi: mem01: [Loader Data| || | | | |WB|WT|WC|UC] > > range=[0x1000-0x2000) > (0MB) > >... > >efi: mem35: [Boot Data | || | | | |WB|WT|WC|UC] > > range=[0x47ee6000-0x48014000) > (1MB) > >efi: mem36: [Conventional Memory| |RELY| | | | |WB|WT|WC|UC] > > range=[0x0001-0x00018000) > (2048MB) > >efi: mem37: [Conventional Memory| || | | | |WB|WT|WC|UC] > > range=[0x00018000-0x0010a000) > (61952MB) > >efi: mem38: [Conventional Memory| |RELY| | | | |WB|WT|WC|UC] > > range=[0x0010a000-0x00112000) > (2048MB) > >efi: mem39: [Conventional Memory| || | | | |WB|WT|WC|UC] > > range=[0x00112000-0x0020a000) > (63488MB) > >efi: mem40: [Reserved |RUN|| | | | | | | |UC] > > range=[0x6000-0x9000) > (768MB) > > > > And you will find that the following message is output: > > > >efi: Memory: 4096M/131455M mirrored memory > > > > Signed-off-by: Taku Izumi > > --- > > Documentation/kernel-parameters.txt | 8 ++ > > arch/x86/include/asm/efi.h | 2 + > > arch/x86/kernel/setup.c | 4 +- > > arch/x86/platform/efi/efi.c | 2 +- > > arch/x86/platform/efi/quirks.c | 169 > > > > 5 files changed, 183 insertions(+), 2 deletions(-) > > [...] > > > diff --git a/arch/x86/platform/efi/quirks.c b/arch/x86/platform/efi/quirks.c > > index 1c7380d..5c785e1 100644 > > --- a/arch/x86/platform/efi/quirks.c > > +++ b/arch/x86/platform/efi/quirks.c > > @@ -18,6 +18,10 @@ > > > > The quirks file isn't intended to be used for this kind of feature. > It's very much a repository for workarounds for quirky firmware, i.e. > known bugs. > > Instead, how about putting all this into a new fake_mem.c file? Going > further than that, there's nothing that I can see that looks > particularly x86-specific, so how about sticking all this in > drivers/firmware/efi/fake_mem.c so that the arm64 folks can make use > of it if/when they want to start playing around with > EFI_MEMORY_MORE_RELIABLE? > > > static efi_char16_t efi_dummy_name[6] = { 'D', 'U', 'M', 'M', 'Y', 0 }; > > > > +#define EFI_MAX_FAKE_MIRROR 8 > > +static struct range fake_mirrors[EFI_MAX_FAKE_MIRROR]; > > +static int num_fake_mirror; > > + > > static bool efi_no_storage_paranoia; > > > >
[PATCH v2 1/3] efi: Add EFI_MEMORY_MORE_RELIABLE support to efi_md_typeattr_format()
UEFI spec 2.5 introduces new Memory Attribute Definition named EFI_MEMORY_MORE_RELIABLE. This patch adds this new attribute support to efi_md_typeattr_format(). Signed-off-by: Taku Izumi --- drivers/firmware/efi/efi.c | 6 -- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/drivers/firmware/efi/efi.c b/drivers/firmware/efi/efi.c index d6144e3..8124078 100644 --- a/drivers/firmware/efi/efi.c +++ b/drivers/firmware/efi/efi.c @@ -589,12 +589,14 @@ char * __init efi_md_typeattr_format(char *buf, size_t size, attr = md->attribute; if (attr & ~(EFI_MEMORY_UC | EFI_MEMORY_WC | EFI_MEMORY_WT | EFI_MEMORY_WB | EFI_MEMORY_UCE | EFI_MEMORY_WP | -EFI_MEMORY_RP | EFI_MEMORY_XP | EFI_MEMORY_RUNTIME)) +EFI_MEMORY_RP | EFI_MEMORY_XP | EFI_MEMORY_RUNTIME | +EFI_MEMORY_MORE_RELIABLE)) snprintf(pos, size, "|attr=0x%016llx]", (unsigned long long)attr); else - snprintf(pos, size, "|%3s|%2s|%2s|%2s|%3s|%2s|%2s|%2s|%2s]", + snprintf(pos, size, "|%3s|%2s|%2s|%2s|%2s|%3s|%2s|%2s|%2s|%2s]", attr & EFI_MEMORY_RUNTIME ? "RUN" : "", +attr & EFI_MEMORY_MORE_RELIABLE ? "MR" : "", attr & EFI_MEMORY_XP ? "XP" : "", attr & EFI_MEMORY_RP ? "RP" : "", attr & EFI_MEMORY_WP ? "WP" : "", -- 1.8.3.1 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v2 0/3] Introduce "efi_fake_mem_mirror" boot option
UEFI spec 2.5 introduces new Memory Attribute Definition named EFI_MEMORY_MORE_RELIABLE which indicates which memory ranges are mirrored. Now linux kernel can recognize which memory ranges are mirrored by handling EFI_MEMORY_MORE_RELIABLE attributes. However testing this feature necesitates boxes with UEFI spec 2.5 complied firmware. This patchset introduces new boot option named "efi_fake_mem_mirror". By specifying this parameter, you can mark specific memory as mirrored memory. This is useful for debugging of Memory Address Range Mirroring feature. v1 -> v2: - change abbreviation of EFI_MEMORY_MORE_RELIABLE from "RELY" to "MR" - add patch (2/3) for changing abbreviation of EFI_MEMORY_RUNTIME - migrate some code from arch/x86/platform/efi/quirks to drivers/firmware/efi/fake_mem.c and create config EFI_FAKE_MEMMAP Taku Izumi (3): efi: Add EFI_MEMORY_MORE_RELIABLE support to efi_md_typeattr_format() efi: Change abbreviation of EFI_MEMORY_RUNTIME from "RUN" to "RT" x86, efi: Add "efi_fake_mem_mirror" boot option Documentation/kernel-parameters.txt | 8 ++ arch/x86/include/asm/efi.h | 1 + arch/x86/kernel/setup.c | 4 +- arch/x86/platform/efi/efi.c | 2 +- drivers/firmware/efi/Kconfig| 12 +++ drivers/firmware/efi/Makefile | 1 + drivers/firmware/efi/efi.c | 8 +- drivers/firmware/efi/fake_mem.c | 204 include/linux/efi.h | 6 ++ 9 files changed, 241 insertions(+), 5 deletions(-) create mode 100644 drivers/firmware/efi/fake_mem.c -- 1.8.3.1 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v2 2/3] efi: Change abbreviation of EFI_MEMORY_RUNTIME from "RUN" to "RT"
Now efi_md_typeattr_format() outputs "RUN" if passed EFI memory descriptor has EFI_MEMORY_RUNTIME attribute. But "RT" is preferer because it is shorter and clearer. This patch changes abbreviation of EFI_MEMORY_RUNTIME from "RUN" to "RT". Suggested-by: Ard Biesheuvel Signed-off-by: Taku Izumi --- drivers/firmware/efi/efi.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/firmware/efi/efi.c b/drivers/firmware/efi/efi.c index 8124078..25b6477 100644 --- a/drivers/firmware/efi/efi.c +++ b/drivers/firmware/efi/efi.c @@ -594,8 +594,8 @@ char * __init efi_md_typeattr_format(char *buf, size_t size, snprintf(pos, size, "|attr=0x%016llx]", (unsigned long long)attr); else - snprintf(pos, size, "|%3s|%2s|%2s|%2s|%2s|%3s|%2s|%2s|%2s|%2s]", -attr & EFI_MEMORY_RUNTIME ? "RUN" : "", + snprintf(pos, size, "|%2s|%2s|%2s|%2s|%2s|%3s|%2s|%2s|%2s|%2s]", +attr & EFI_MEMORY_RUNTIME ? "RT" : "", attr & EFI_MEMORY_MORE_RELIABLE ? "MR" : "", attr & EFI_MEMORY_XP ? "XP" : "", attr & EFI_MEMORY_RP ? "RP" : "", -- 1.8.3.1 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v2 3/3] x86, efi: Add "efi_fake_mem_mirror" boot option
This patch introduces new boot option named "efi_fake_mem_mirror". By specifying this parameter, you can mark specific memory as mirrored memory. This is useful for debugging of Address Range Mirroring feature. For example, if you specify "efi_fake_mem_mirror=2G@4G,2G@0x10a000", the original (firmware provided) EFI memmap will be updated so that the specified memory regions have EFI_MEMORY_MORE_RELIABLE attribute: efi: mem00: [Boot Data | | | | | | |WB|WT|WC|UC] range=[0x-0x1000) (0MB) efi: mem01: [Loader Data| | | | | | |WB|WT|WC|UC] range=[0x1000-0x2000) (0MB) ... efi: mem35: [Boot Data | | | | | | |WB|WT|WC|UC] range=[0x47ee6000-0x48014000) (1MB) efi: mem36: [Conventional Memory| | | | | | |WB|WT|WC|UC] range=[0x0001-0x0020a000) (129536MB) efi: mem37: [Reserved |RT| | | | | | | | |UC] range=[0x6000-0x9000) (768MB) efi: mem00: [Boot Data | | | | | | |WB|WT|WC|UC] range=[0x-0x1000) (0MB) efi: mem01: [Loader Data| | | | | | |WB|WT|WC|UC] range=[0x1000-0x2000) (0MB) ... efi: mem35: [Boot Data | | | | | | |WB|WT|WC|UC] range=[0x47ee6000-0x48014000) (1MB) efi: mem36: [Conventional Memory| |MR| | | | |WB|WT|WC|UC] range=[0x0001-0x00018000) (2048MB) efi: mem37: [Conventional Memory| | | | | | |WB|WT|WC|UC] range=[0x00018000-0x0010a000) (61952MB) efi: mem38: [Conventional Memory| |MR| | | | |WB|WT|WC|UC] range=[0x0010a000-0x00112000) (2048MB) efi: mem39: [Conventional Memory| | | | | | |WB|WT|WC|UC] range=[0x00112000-0x0020a000) (63488MB) efi: mem40: [Reserved |RT| | | | | | | | |UC] range=[0x6000-0x9000) (768MB) And you will find that the following message is output: efi: Memory: 4096M/131455M mirrored memory Signed-off-by: Taku Izumi --- Documentation/kernel-parameters.txt | 8 ++ arch/x86/include/asm/efi.h | 1 + arch/x86/kernel/setup.c | 4 +- arch/x86/platform/efi/efi.c | 2 +- drivers/firmware/efi/Kconfig| 12 +++ drivers/firmware/efi/Makefile | 1 + drivers/firmware/efi/fake_mem.c | 204 include/linux/efi.h | 6 ++ 8 files changed, 236 insertions(+), 2 deletions(-) create mode 100644 drivers/firmware/efi/fake_mem.c diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt index 1d6f045..0efded6 100644 --- a/Documentation/kernel-parameters.txt +++ b/Documentation/kernel-parameters.txt @@ -1092,6 +1092,14 @@ bytes respectively. Such letter suffixes can also be entirely omitted. you are really sure that your UEFI does sane gc and fulfills the spec otherwise your board may brick. + efi_fake_mem_mirror=nn[KMG]@ss[KMG][,nn[KMG]@ss[KMG],..] [EFI; X86] + Mark specific memory as mirrored memory and update + EFI memory map. + Region of memory to be marked is from ss to ss+nn. + Using this parameter you can do debugging of Address + Range Mirroring feature even if your box doesn't support + it. + eisa_irq_edge= [PARISC,HW] See header of drivers/parisc/eisa.c. diff --git a/arch/x86/include/asm/efi.h b/arch/x86/include/asm/efi.h index 155162e..479fd51 100644 --- a/arch/x86/include/asm/efi.h +++ b/arch/x86/include/asm/efi.h @@ -93,6 +93,7 @@ extern void __init efi_set_executable(efi_memory_desc_t *md, bool executable); extern int __init efi_memblock_x86_reserve_range(void); extern pgd_t * __init efi_call_phys_prolog(void); extern void __init efi_call_phys_epilog(pgd_t *save_pgd); +extern void __init print_efi_memmap(void); extern void __init efi_unmap_memmap(void); extern void __init efi_memory_uc(u64 addr, unsigned long size); extern void __init efi_map_region(efi_memory_desc_t *md); diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c index 80f874b..e3ed628 100644 --- a/arch/x86/kernel/setup.c +++ b/arch/x86/kernel/setup.c @@ -1104,8 +1104,10 @@ void __init setup_arch(char **cmdline_p) memblock_set_current_limit(ISA_END_ADDRESS); memblock_x86_fill(); - if (efi_enabled(EFI_BOOT)) + if (efi_enabled(EFI_BOOT)) { + efi_fake_memmap(); efi_find_mirror(); + } /* * The EFI specification says that boot service code won't be called diff --git a/arch/x86/platform/efi/efi.c b/arch/x86/platform/efi/efi.c index e4308fe..eee8068 100644 --- a/a
RE: [PATCH 2/2] x86, efi: Add efi_fake_mem_mirror boot option
Dear Matt, Thank you for reviewing. I updated my patchset. I'm happy if you review new one. Sincerely, Taku Izumi -Original Message- From: Matt Fleming [mailto:m...@codeblueprint.co.uk] Sent: Wednesday, August 26, 2015 8:46 AM To: Izumi, Taku/泉 拓 Cc: linux-kernel@vger.kernel.org; linux-...@vger.kernel.org; x...@kernel.org; matt.flem...@intel.com; t...@linutronix.de; mi...@redhat.com; h...@zytor.com; tony.l...@intel.com; qiuxi...@huawei.com; Kamezawa, Hiroyuki/亀 澤 寛之 Subject: Re: [PATCH 2/2] x86, efi: Add efi_fake_mem_mirror boot option On Fri, 21 Aug, at 02:16:00AM, Taku Izumi wrote: This patch introduces new boot option named efi_fake_mem_mirror. By specifying this parameter, you can mark specific memory as mirrored memory. This is useful for debugging of Address Range Mirroring feature. For example, if you specify efi_fake_mem_mirror=2G@4G,2G@0x10a000, the original (firmware provided) EFI memmap will be updated so that the specified memory regions have EFI_MEMORY_MORE_RELIABLE attribute: original EFI memmap efi: mem00: [Boot Data | || | | | |WB|WT|WC|UC] range=[0x-0x1000) (0MB) efi: mem01: [Loader Data| || | | | |WB|WT|WC|UC] range=[0x1000-0x2000) (0MB) ... efi: mem35: [Boot Data | || | | | |WB|WT|WC|UC] range=[0x47ee6000-0x48014000) (1MB) efi: mem36: [Conventional Memory| || | | | |WB|WT|WC|UC] range=[0x0001-0x0020a000) (129536MB) efi: mem37: [Reserved |RUN|| | | | | | | |UC] range=[0x6000-0x9000) (768MB) updated EFI memmap efi: mem00: [Boot Data | || | | | |WB|WT|WC|UC] range=[0x-0x1000) (0MB) efi: mem01: [Loader Data| || | | | |WB|WT|WC|UC] range=[0x1000-0x2000) (0MB) ... efi: mem35: [Boot Data | || | | | |WB|WT|WC|UC] range=[0x47ee6000-0x48014000) (1MB) efi: mem36: [Conventional Memory| |RELY| | | | |WB|WT|WC|UC] range=[0x0001-0x00018000) (2048MB) efi: mem37: [Conventional Memory| || | | | |WB|WT|WC|UC] range=[0x00018000-0x0010a000) (61952MB) efi: mem38: [Conventional Memory| |RELY| | | | |WB|WT|WC|UC] range=[0x0010a000-0x00112000) (2048MB) efi: mem39: [Conventional Memory| || | | | |WB|WT|WC|UC] range=[0x00112000-0x0020a000) (63488MB) efi: mem40: [Reserved |RUN|| | | | | | | |UC] range=[0x6000-0x9000) (768MB) And you will find that the following message is output: efi: Memory: 4096M/131455M mirrored memory Signed-off-by: Taku Izumi izumi.t...@jp.fujitsu.com --- Documentation/kernel-parameters.txt | 8 ++ arch/x86/include/asm/efi.h | 2 + arch/x86/kernel/setup.c | 4 +- arch/x86/platform/efi/efi.c | 2 +- arch/x86/platform/efi/quirks.c | 169 5 files changed, 183 insertions(+), 2 deletions(-) [...] diff --git a/arch/x86/platform/efi/quirks.c b/arch/x86/platform/efi/quirks.c index 1c7380d..5c785e1 100644 --- a/arch/x86/platform/efi/quirks.c +++ b/arch/x86/platform/efi/quirks.c @@ -18,6 +18,10 @@ The quirks file isn't intended to be used for this kind of feature. It's very much a repository for workarounds for quirky firmware, i.e. known bugs. Instead, how about putting all this into a new fake_mem.c file? Going further than that, there's nothing that I can see that looks particularly x86-specific, so how about sticking all this in drivers/firmware/efi/fake_mem.c so that the arm64 folks can make use of it if/when they want to start playing around with EFI_MEMORY_MORE_RELIABLE? static efi_char16_t efi_dummy_name[6] = { 'D', 'U', 'M', 'M', 'Y', 0 }; +#define EFI_MAX_FAKE_MIRROR 8 +static struct range fake_mirrors[EFI_MAX_FAKE_MIRROR]; +static int num_fake_mirror; + static bool efi_no_storage_paranoia; /* @@ -288,3 +292,168 @@ bool efi_poweroff_required(void) { return !!acpi_gbl_reduced_hardware; } + +void __init efi_fake_memmap(void) +{ + efi_memory_desc_t *md; + void *p, *q; + int i; + int nr_map = memmap.nr_map; + u64 start, end, m_start, m_end; + u64 new_memmap_phy; + void *new_memmap; + + if (!num_fake_mirror) + return; + + /* count up the number of EFI memory descriptor */ + for (p = memmap.map; p memmap.map_end; p += memmap.desc_size) { + md = p; + start = md-phys_addr; + end = start + (md-num_pages EFI_PAGE_SHIFT) - 1; + + for (i = 0; i num_fake_mirror; i
[PATCH v2 3/3] x86, efi: Add efi_fake_mem_mirror boot option
This patch introduces new boot option named efi_fake_mem_mirror. By specifying this parameter, you can mark specific memory as mirrored memory. This is useful for debugging of Address Range Mirroring feature. For example, if you specify efi_fake_mem_mirror=2G@4G,2G@0x10a000, the original (firmware provided) EFI memmap will be updated so that the specified memory regions have EFI_MEMORY_MORE_RELIABLE attribute: original EFI memmap efi: mem00: [Boot Data | | | | | | |WB|WT|WC|UC] range=[0x-0x1000) (0MB) efi: mem01: [Loader Data| | | | | | |WB|WT|WC|UC] range=[0x1000-0x2000) (0MB) ... efi: mem35: [Boot Data | | | | | | |WB|WT|WC|UC] range=[0x47ee6000-0x48014000) (1MB) efi: mem36: [Conventional Memory| | | | | | |WB|WT|WC|UC] range=[0x0001-0x0020a000) (129536MB) efi: mem37: [Reserved |RT| | | | | | | | |UC] range=[0x6000-0x9000) (768MB) updated EFI memmap efi: mem00: [Boot Data | | | | | | |WB|WT|WC|UC] range=[0x-0x1000) (0MB) efi: mem01: [Loader Data| | | | | | |WB|WT|WC|UC] range=[0x1000-0x2000) (0MB) ... efi: mem35: [Boot Data | | | | | | |WB|WT|WC|UC] range=[0x47ee6000-0x48014000) (1MB) efi: mem36: [Conventional Memory| |MR| | | | |WB|WT|WC|UC] range=[0x0001-0x00018000) (2048MB) efi: mem37: [Conventional Memory| | | | | | |WB|WT|WC|UC] range=[0x00018000-0x0010a000) (61952MB) efi: mem38: [Conventional Memory| |MR| | | | |WB|WT|WC|UC] range=[0x0010a000-0x00112000) (2048MB) efi: mem39: [Conventional Memory| | | | | | |WB|WT|WC|UC] range=[0x00112000-0x0020a000) (63488MB) efi: mem40: [Reserved |RT| | | | | | | | |UC] range=[0x6000-0x9000) (768MB) And you will find that the following message is output: efi: Memory: 4096M/131455M mirrored memory Signed-off-by: Taku Izumi izumi.t...@jp.fujitsu.com --- Documentation/kernel-parameters.txt | 8 ++ arch/x86/include/asm/efi.h | 1 + arch/x86/kernel/setup.c | 4 +- arch/x86/platform/efi/efi.c | 2 +- drivers/firmware/efi/Kconfig| 12 +++ drivers/firmware/efi/Makefile | 1 + drivers/firmware/efi/fake_mem.c | 204 include/linux/efi.h | 6 ++ 8 files changed, 236 insertions(+), 2 deletions(-) create mode 100644 drivers/firmware/efi/fake_mem.c diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt index 1d6f045..0efded6 100644 --- a/Documentation/kernel-parameters.txt +++ b/Documentation/kernel-parameters.txt @@ -1092,6 +1092,14 @@ bytes respectively. Such letter suffixes can also be entirely omitted. you are really sure that your UEFI does sane gc and fulfills the spec otherwise your board may brick. + efi_fake_mem_mirror=nn[KMG]@ss[KMG][,nn[KMG]@ss[KMG],..] [EFI; X86] + Mark specific memory as mirrored memory and update + EFI memory map. + Region of memory to be marked is from ss to ss+nn. + Using this parameter you can do debugging of Address + Range Mirroring feature even if your box doesn't support + it. + eisa_irq_edge= [PARISC,HW] See header of drivers/parisc/eisa.c. diff --git a/arch/x86/include/asm/efi.h b/arch/x86/include/asm/efi.h index 155162e..479fd51 100644 --- a/arch/x86/include/asm/efi.h +++ b/arch/x86/include/asm/efi.h @@ -93,6 +93,7 @@ extern void __init efi_set_executable(efi_memory_desc_t *md, bool executable); extern int __init efi_memblock_x86_reserve_range(void); extern pgd_t * __init efi_call_phys_prolog(void); extern void __init efi_call_phys_epilog(pgd_t *save_pgd); +extern void __init print_efi_memmap(void); extern void __init efi_unmap_memmap(void); extern void __init efi_memory_uc(u64 addr, unsigned long size); extern void __init efi_map_region(efi_memory_desc_t *md); diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c index 80f874b..e3ed628 100644 --- a/arch/x86/kernel/setup.c +++ b/arch/x86/kernel/setup.c @@ -1104,8 +1104,10 @@ void __init setup_arch(char **cmdline_p) memblock_set_current_limit(ISA_END_ADDRESS); memblock_x86_fill(); - if (efi_enabled(EFI_BOOT)) + if (efi_enabled(EFI_BOOT)) { + efi_fake_memmap(); efi_find_mirror(); + } /* * The EFI specification says that boot service code won't be called diff --git a/arch/x86/platform/efi/efi.c b/arch/x86/platform/efi/efi.c
[PATCH v2 2/3] efi: Change abbreviation of EFI_MEMORY_RUNTIME from RUN to RT
Now efi_md_typeattr_format() outputs RUN if passed EFI memory descriptor has EFI_MEMORY_RUNTIME attribute. But RT is preferer because it is shorter and clearer. This patch changes abbreviation of EFI_MEMORY_RUNTIME from RUN to RT. Suggested-by: Ard Biesheuvel ard.biesheu...@linaro.org Signed-off-by: Taku Izumi izumi.t...@jp.fujitsu.com --- drivers/firmware/efi/efi.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/firmware/efi/efi.c b/drivers/firmware/efi/efi.c index 8124078..25b6477 100644 --- a/drivers/firmware/efi/efi.c +++ b/drivers/firmware/efi/efi.c @@ -594,8 +594,8 @@ char * __init efi_md_typeattr_format(char *buf, size_t size, snprintf(pos, size, |attr=0x%016llx], (unsigned long long)attr); else - snprintf(pos, size, |%3s|%2s|%2s|%2s|%2s|%3s|%2s|%2s|%2s|%2s], -attr EFI_MEMORY_RUNTIME ? RUN : , + snprintf(pos, size, |%2s|%2s|%2s|%2s|%2s|%3s|%2s|%2s|%2s|%2s], +attr EFI_MEMORY_RUNTIME ? RT : , attr EFI_MEMORY_MORE_RELIABLE ? MR : , attr EFI_MEMORY_XP ? XP : , attr EFI_MEMORY_RP ? RP : , -- 1.8.3.1 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v2 1/3] efi: Add EFI_MEMORY_MORE_RELIABLE support to efi_md_typeattr_format()
UEFI spec 2.5 introduces new Memory Attribute Definition named EFI_MEMORY_MORE_RELIABLE. This patch adds this new attribute support to efi_md_typeattr_format(). Signed-off-by: Taku Izumi izumi.t...@jp.fujitsu.com --- drivers/firmware/efi/efi.c | 6 -- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/drivers/firmware/efi/efi.c b/drivers/firmware/efi/efi.c index d6144e3..8124078 100644 --- a/drivers/firmware/efi/efi.c +++ b/drivers/firmware/efi/efi.c @@ -589,12 +589,14 @@ char * __init efi_md_typeattr_format(char *buf, size_t size, attr = md-attribute; if (attr ~(EFI_MEMORY_UC | EFI_MEMORY_WC | EFI_MEMORY_WT | EFI_MEMORY_WB | EFI_MEMORY_UCE | EFI_MEMORY_WP | -EFI_MEMORY_RP | EFI_MEMORY_XP | EFI_MEMORY_RUNTIME)) +EFI_MEMORY_RP | EFI_MEMORY_XP | EFI_MEMORY_RUNTIME | +EFI_MEMORY_MORE_RELIABLE)) snprintf(pos, size, |attr=0x%016llx], (unsigned long long)attr); else - snprintf(pos, size, |%3s|%2s|%2s|%2s|%3s|%2s|%2s|%2s|%2s], + snprintf(pos, size, |%3s|%2s|%2s|%2s|%2s|%3s|%2s|%2s|%2s|%2s], attr EFI_MEMORY_RUNTIME ? RUN : , +attr EFI_MEMORY_MORE_RELIABLE ? MR : , attr EFI_MEMORY_XP ? XP : , attr EFI_MEMORY_RP ? RP : , attr EFI_MEMORY_WP ? WP : , -- 1.8.3.1 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v2 0/3] Introduce efi_fake_mem_mirror boot option
UEFI spec 2.5 introduces new Memory Attribute Definition named EFI_MEMORY_MORE_RELIABLE which indicates which memory ranges are mirrored. Now linux kernel can recognize which memory ranges are mirrored by handling EFI_MEMORY_MORE_RELIABLE attributes. However testing this feature necesitates boxes with UEFI spec 2.5 complied firmware. This patchset introduces new boot option named efi_fake_mem_mirror. By specifying this parameter, you can mark specific memory as mirrored memory. This is useful for debugging of Memory Address Range Mirroring feature. v1 - v2: - change abbreviation of EFI_MEMORY_MORE_RELIABLE from RELY to MR - add patch (2/3) for changing abbreviation of EFI_MEMORY_RUNTIME - migrate some code from arch/x86/platform/efi/quirks to drivers/firmware/efi/fake_mem.c and create config EFI_FAKE_MEMMAP Taku Izumi (3): efi: Add EFI_MEMORY_MORE_RELIABLE support to efi_md_typeattr_format() efi: Change abbreviation of EFI_MEMORY_RUNTIME from RUN to RT x86, efi: Add efi_fake_mem_mirror boot option Documentation/kernel-parameters.txt | 8 ++ arch/x86/include/asm/efi.h | 1 + arch/x86/kernel/setup.c | 4 +- arch/x86/platform/efi/efi.c | 2 +- drivers/firmware/efi/Kconfig| 12 +++ drivers/firmware/efi/Makefile | 1 + drivers/firmware/efi/efi.c | 8 +- drivers/firmware/efi/fake_mem.c | 204 include/linux/efi.h | 6 ++ 9 files changed, 241 insertions(+), 5 deletions(-) create mode 100644 drivers/firmware/efi/fake_mem.c -- 1.8.3.1 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v2][RESEND] perf, x86: Fix multi-segment problem of perf_event_intel_uncore
In multi-segment system, uncore devices may belong to buses whose segment number is other than 0. :ff:10.5 System peripheral: Intel Corporation Xeon E5 v3/Core i7 Scratchpad Semaphore Registers (rev 03) ... 0001:7f:10.5 System peripheral: Intel Corporation Xeon E5 v3/Core i7 Scratchpad Semaphore Registers (rev 03) ... 0001:bf:10.5 System peripheral: Intel Corporation Xeon E5 v3/Core i7 Scratchpad Semaphore Registers (rev 03) ... 0001:ff:10.5 System peripheral: Intel Corporation Xeon E5 v3/Core i7 Scratchpad Semaphore Registers (rev 03 ... In that case relation of bus number and physical id may be broken because uncore_pcibus_to_physid doesn't take account of PCI segment. For example, bus :ff and 0001:ff uses the same entry of uncore_pcibus_to_physid array. This patch fixes ths problem by introducing segment-aware pci2phy_map instead. v1 - v2: - Extract method named uncore_pcibus_to_physid to avoid repetetion of retrieving phys_id code Signed-off-by: Taku Izumi izumi.t...@jp.fujitsu.com --- arch/x86/kernel/cpu/perf_event_intel_uncore.c | 25 -- arch/x86/kernel/cpu/perf_event_intel_uncore.h | 11 - arch/x86/kernel/cpu/perf_event_intel_uncore_snb.c | 23 +- .../x86/kernel/cpu/perf_event_intel_uncore_snbep.c | 53 -- 4 files changed, 94 insertions(+), 18 deletions(-) diff --git a/arch/x86/kernel/cpu/perf_event_intel_uncore.c b/arch/x86/kernel/cpu/perf_event_intel_uncore.c index 21b5e38..0ed6f2b 100644 --- a/arch/x86/kernel/cpu/perf_event_intel_uncore.c +++ b/arch/x86/kernel/cpu/perf_event_intel_uncore.c @@ -7,7 +7,8 @@ struct intel_uncore_type **uncore_pci_uncores = empty_uncore; static bool pcidrv_registered; struct pci_driver *uncore_pci_driver; /* pci bus to socket mapping */ -int uncore_pcibus_to_physid[256] = { [0 ... 255] = -1, }; +DEFINE_RAW_SPINLOCK(pci2phy_map_lock); +struct list_head pci2phy_map_head = LIST_HEAD_INIT(pci2phy_map_head); struct pci_dev *uncore_extra_pci_dev[UNCORE_SOCKET_MAX][UNCORE_EXTRA_PCI_DEV_MAX]; static DEFINE_RAW_SPINLOCK(uncore_box_lock); @@ -20,6 +21,23 @@ static struct event_constraint uncore_constraint_fixed = struct event_constraint uncore_constraint_empty = EVENT_CONSTRAINT(0, 0, 0); +int uncore_pcibus_to_physid(struct pci_bus *bus) +{ + int phys_id = -1; + struct pci2phy_map *map; + + raw_spin_lock(pci2phy_map_lock); + list_for_each_entry(map, pci2phy_map_head, list) { + if (map-segment == pci_domain_nr(bus)) { + phys_id = map-pbus_to_physid[bus-number]; + break; + } + } + raw_spin_unlock(pci2phy_map_lock); + + return phys_id; +} + ssize_t uncore_event_show(struct kobject *kobj, struct kobj_attribute *attr, char *buf) { @@ -809,7 +827,7 @@ static int uncore_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id int phys_id; bool first_box = false; - phys_id = uncore_pcibus_to_physid[pdev-bus-number]; + phys_id = uncore_pcibus_to_physid(pdev-bus); if (phys_id 0) return -ENODEV; @@ -856,9 +874,10 @@ static void uncore_pci_remove(struct pci_dev *pdev) { struct intel_uncore_box *box = pci_get_drvdata(pdev); struct intel_uncore_pmu *pmu; - int i, cpu, phys_id = uncore_pcibus_to_physid[pdev-bus-number]; + int i, cpu, phys_id; bool last_box = false; + phys_id = uncore_pcibus_to_physid(pdev-bus); box = pci_get_drvdata(pdev); if (!box) { for (i = 0; i UNCORE_EXTRA_PCI_DEV_MAX; i++) { diff --git a/arch/x86/kernel/cpu/perf_event_intel_uncore.h b/arch/x86/kernel/cpu/perf_event_intel_uncore.h index 0f77f0a..6c96ee9 100644 --- a/arch/x86/kernel/cpu/perf_event_intel_uncore.h +++ b/arch/x86/kernel/cpu/perf_event_intel_uncore.h @@ -117,6 +117,14 @@ struct uncore_event_desc { const char *config; }; +struct pci2phy_map { + struct list_head list; + int segment; + int pbus_to_physid[256]; +}; + +int uncore_pcibus_to_physid(struct pci_bus *bus); + ssize_t uncore_event_show(struct kobject *kobj, struct kobj_attribute *attr, char *buf); @@ -317,7 +325,8 @@ u64 uncore_shared_reg_config(struct intel_uncore_box *box, int idx); extern struct intel_uncore_type **uncore_msr_uncores; extern struct intel_uncore_type **uncore_pci_uncores; extern struct pci_driver *uncore_pci_driver; -extern int uncore_pcibus_to_physid[256]; +extern raw_spinlock_t pci2phy_map_lock; +extern struct list_head pci2phy_map_head; extern struct pci_dev *uncore_extra_pci_dev[UNCORE_SOCKET_MAX][UNCORE_EXTRA_PCI_DEV_MAX]; extern struct event_constraint uncore_constraint_empty; diff --git a/arch/x86/kernel/cpu/perf_event_intel_uncore_snb.c b/arch/x86/kernel/cpu/perf_event_intel_uncore_snb.c index b005a78..ccbc817 100644 --- a/arch/x86/kernel/cpu
[PATCH 0/2][RFC] Introduce "efi_fake_mem_mirror" boot option
UEFI spec 2.5 introduces new Memory Attribute Definition named EFI_MEMORY_MORE_RELIABLE which indicates which memory ranges are mirrored. Now linux kernel can recognize which memory ranges are mirrored by handling EFI_MEMORY_MORE_RELIABLE attributes. However testing this feature necesitates boxes with UEFI spec 2.5 complied firmware. This patchset introduces new boot option named "efi_fake_mem_mirror". By specifying this parameter, you can mark specific memory as mirrored memory. This is useful for debugging of Memory Address Range Mirroring feature. Taku Izumi (2): efi: Add EFI_MEMORY_MORE_RELIABLE support to efi_md_typeattr_format() x86, efi: Add "efi_fake_mem_mirror" boot option Documentation/kernel-parameters.txt | 8 ++ arch/x86/include/asm/efi.h | 2 + arch/x86/kernel/setup.c | 4 +- arch/x86/platform/efi/efi.c | 2 +- arch/x86/platform/efi/quirks.c | 169 drivers/firmware/efi/efi.c | 6 +- 6 files changed, 187 insertions(+), 4 deletions(-) -- 1.8.3.1 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 1/2] efi: Add EFI_MEMORY_MORE_RELIABLE support to efi_md_typeattr_format()
UEFI spec 2.5 introduces new Memory Attribute Definition named EFI_MEMORY_MORE_RELIABLE. This patch adds this new attribute support to efi_md_typeattr_format(). Signed-off-by: Taku Izumi --- drivers/firmware/efi/efi.c | 6 -- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/drivers/firmware/efi/efi.c b/drivers/firmware/efi/efi.c index d6144e3..aadc1c4 100644 --- a/drivers/firmware/efi/efi.c +++ b/drivers/firmware/efi/efi.c @@ -589,12 +589,14 @@ char * __init efi_md_typeattr_format(char *buf, size_t size, attr = md->attribute; if (attr & ~(EFI_MEMORY_UC | EFI_MEMORY_WC | EFI_MEMORY_WT | EFI_MEMORY_WB | EFI_MEMORY_UCE | EFI_MEMORY_WP | -EFI_MEMORY_RP | EFI_MEMORY_XP | EFI_MEMORY_RUNTIME)) +EFI_MEMORY_RP | EFI_MEMORY_XP | EFI_MEMORY_RUNTIME | +EFI_MEMORY_MORE_RELIABLE)) snprintf(pos, size, "|attr=0x%016llx]", (unsigned long long)attr); else - snprintf(pos, size, "|%3s|%2s|%2s|%2s|%3s|%2s|%2s|%2s|%2s]", + snprintf(pos, size, "|%3s|%4s|%2s|%2s|%2s|%3s|%2s|%2s|%2s|%2s]", attr & EFI_MEMORY_RUNTIME ? "RUN" : "", +attr & EFI_MEMORY_MORE_RELIABLE ? "RELY" : "", attr & EFI_MEMORY_XP ? "XP" : "", attr & EFI_MEMORY_RP ? "RP" : "", attr & EFI_MEMORY_WP ? "WP" : "", -- 1.8.3.1 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 2/2] x86, efi: Add "efi_fake_mem_mirror" boot option
This patch introduces new boot option named "efi_fake_mem_mirror". By specifying this parameter, you can mark specific memory as mirrored memory. This is useful for debugging of Address Range Mirroring feature. For example, if you specify "efi_fake_mem_mirror=2G@4G,2G@0x10a000", the original (firmware provided) EFI memmap will be updated so that the specified memory regions have EFI_MEMORY_MORE_RELIABLE attribute: efi: mem00: [Boot Data | || | | | |WB|WT|WC|UC] range=[0x-0x1000) (0MB) efi: mem01: [Loader Data| || | | | |WB|WT|WC|UC] range=[0x1000-0x2000) (0MB) ... efi: mem35: [Boot Data | || | | | |WB|WT|WC|UC] range=[0x47ee6000-0x48014000) (1MB) efi: mem36: [Conventional Memory| || | | | |WB|WT|WC|UC] range=[0x0001-0x0020a000) (129536MB) efi: mem37: [Reserved |RUN|| | | | | | | |UC] range=[0x6000-0x9000) (768MB) efi: mem00: [Boot Data | || | | | |WB|WT|WC|UC] range=[0x-0x1000) (0MB) efi: mem01: [Loader Data| || | | | |WB|WT|WC|UC] range=[0x1000-0x2000) (0MB) ... efi: mem35: [Boot Data | || | | | |WB|WT|WC|UC] range=[0x47ee6000-0x48014000) (1MB) efi: mem36: [Conventional Memory| |RELY| | | | |WB|WT|WC|UC] range=[0x0001-0x00018000) (2048MB) efi: mem37: [Conventional Memory| || | | | |WB|WT|WC|UC] range=[0x00018000-0x0010a000) (61952MB) efi: mem38: [Conventional Memory| |RELY| | | | |WB|WT|WC|UC] range=[0x0010a000-0x00112000) (2048MB) efi: mem39: [Conventional Memory| || | | | |WB|WT|WC|UC] range=[0x00112000-0x0020a000) (63488MB) efi: mem40: [Reserved |RUN|| | | | | | | |UC] range=[0x6000-0x9000) (768MB) And you will find that the following message is output: efi: Memory: 4096M/131455M mirrored memory Signed-off-by: Taku Izumi --- Documentation/kernel-parameters.txt | 8 ++ arch/x86/include/asm/efi.h | 2 + arch/x86/kernel/setup.c | 4 +- arch/x86/platform/efi/efi.c | 2 +- arch/x86/platform/efi/quirks.c | 169 5 files changed, 183 insertions(+), 2 deletions(-) diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt index 1d6f045..0efded6 100644 --- a/Documentation/kernel-parameters.txt +++ b/Documentation/kernel-parameters.txt @@ -1092,6 +1092,14 @@ bytes respectively. Such letter suffixes can also be entirely omitted. you are really sure that your UEFI does sane gc and fulfills the spec otherwise your board may brick. + efi_fake_mem_mirror=nn[KMG]@ss[KMG][,nn[KMG]@ss[KMG],..] [EFI; X86] + Mark specific memory as mirrored memory and update + EFI memory map. + Region of memory to be marked is from ss to ss+nn. + Using this parameter you can do debugging of Address + Range Mirroring feature even if your box doesn't support + it. + eisa_irq_edge= [PARISC,HW] See header of drivers/parisc/eisa.c. diff --git a/arch/x86/include/asm/efi.h b/arch/x86/include/asm/efi.h index 155162e..50e53cc 100644 --- a/arch/x86/include/asm/efi.h +++ b/arch/x86/include/asm/efi.h @@ -93,6 +93,7 @@ extern void __init efi_set_executable(efi_memory_desc_t *md, bool executable); extern int __init efi_memblock_x86_reserve_range(void); extern pgd_t * __init efi_call_phys_prolog(void); extern void __init efi_call_phys_epilog(pgd_t *save_pgd); +extern void __init print_efi_memmap(void); extern void __init efi_unmap_memmap(void); extern void __init efi_memory_uc(u64 addr, unsigned long size); extern void __init efi_map_region(efi_memory_desc_t *md); @@ -107,6 +108,7 @@ extern void __init efi_dump_pagetable(void); extern void __init efi_apply_memmap_quirks(void); extern int __init efi_reuse_config(u64 tables, int nr_tables); extern void efi_delete_dummy_variable(void); +extern void __init efi_fake_memmap(void); struct efi_setup_data { u64 fw_vendor; diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c index 80f874b..e3ed628 100644 --- a/arch/x86/kernel/setup.c +++ b/arch/x86/kernel/setup.c @@ -1104,8 +1104,10 @@ void __init setup_arch(char **cmdline_p) memblock_set_current_limit(ISA_END_ADDRESS); memblock_x86_fill(); - if (efi_enabled(EFI_BOOT)) + if (efi_enabled(EFI_BOOT)) { + efi_fake_memmap(); efi_find_mirror(); + } /* * The EF
[PATCH 1/2] efi: Add EFI_MEMORY_MORE_RELIABLE support to efi_md_typeattr_format()
UEFI spec 2.5 introduces new Memory Attribute Definition named EFI_MEMORY_MORE_RELIABLE. This patch adds this new attribute support to efi_md_typeattr_format(). Signed-off-by: Taku Izumi izumi.t...@jp.fujitsu.com --- drivers/firmware/efi/efi.c | 6 -- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/drivers/firmware/efi/efi.c b/drivers/firmware/efi/efi.c index d6144e3..aadc1c4 100644 --- a/drivers/firmware/efi/efi.c +++ b/drivers/firmware/efi/efi.c @@ -589,12 +589,14 @@ char * __init efi_md_typeattr_format(char *buf, size_t size, attr = md-attribute; if (attr ~(EFI_MEMORY_UC | EFI_MEMORY_WC | EFI_MEMORY_WT | EFI_MEMORY_WB | EFI_MEMORY_UCE | EFI_MEMORY_WP | -EFI_MEMORY_RP | EFI_MEMORY_XP | EFI_MEMORY_RUNTIME)) +EFI_MEMORY_RP | EFI_MEMORY_XP | EFI_MEMORY_RUNTIME | +EFI_MEMORY_MORE_RELIABLE)) snprintf(pos, size, |attr=0x%016llx], (unsigned long long)attr); else - snprintf(pos, size, |%3s|%2s|%2s|%2s|%3s|%2s|%2s|%2s|%2s], + snprintf(pos, size, |%3s|%4s|%2s|%2s|%2s|%3s|%2s|%2s|%2s|%2s], attr EFI_MEMORY_RUNTIME ? RUN : , +attr EFI_MEMORY_MORE_RELIABLE ? RELY : , attr EFI_MEMORY_XP ? XP : , attr EFI_MEMORY_RP ? RP : , attr EFI_MEMORY_WP ? WP : , -- 1.8.3.1 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 2/2] x86, efi: Add efi_fake_mem_mirror boot option
This patch introduces new boot option named efi_fake_mem_mirror. By specifying this parameter, you can mark specific memory as mirrored memory. This is useful for debugging of Address Range Mirroring feature. For example, if you specify efi_fake_mem_mirror=2G@4G,2G@0x10a000, the original (firmware provided) EFI memmap will be updated so that the specified memory regions have EFI_MEMORY_MORE_RELIABLE attribute: original EFI memmap efi: mem00: [Boot Data | || | | | |WB|WT|WC|UC] range=[0x-0x1000) (0MB) efi: mem01: [Loader Data| || | | | |WB|WT|WC|UC] range=[0x1000-0x2000) (0MB) ... efi: mem35: [Boot Data | || | | | |WB|WT|WC|UC] range=[0x47ee6000-0x48014000) (1MB) efi: mem36: [Conventional Memory| || | | | |WB|WT|WC|UC] range=[0x0001-0x0020a000) (129536MB) efi: mem37: [Reserved |RUN|| | | | | | | |UC] range=[0x6000-0x9000) (768MB) updated EFI memmap efi: mem00: [Boot Data | || | | | |WB|WT|WC|UC] range=[0x-0x1000) (0MB) efi: mem01: [Loader Data| || | | | |WB|WT|WC|UC] range=[0x1000-0x2000) (0MB) ... efi: mem35: [Boot Data | || | | | |WB|WT|WC|UC] range=[0x47ee6000-0x48014000) (1MB) efi: mem36: [Conventional Memory| |RELY| | | | |WB|WT|WC|UC] range=[0x0001-0x00018000) (2048MB) efi: mem37: [Conventional Memory| || | | | |WB|WT|WC|UC] range=[0x00018000-0x0010a000) (61952MB) efi: mem38: [Conventional Memory| |RELY| | | | |WB|WT|WC|UC] range=[0x0010a000-0x00112000) (2048MB) efi: mem39: [Conventional Memory| || | | | |WB|WT|WC|UC] range=[0x00112000-0x0020a000) (63488MB) efi: mem40: [Reserved |RUN|| | | | | | | |UC] range=[0x6000-0x9000) (768MB) And you will find that the following message is output: efi: Memory: 4096M/131455M mirrored memory Signed-off-by: Taku Izumi izumi.t...@jp.fujitsu.com --- Documentation/kernel-parameters.txt | 8 ++ arch/x86/include/asm/efi.h | 2 + arch/x86/kernel/setup.c | 4 +- arch/x86/platform/efi/efi.c | 2 +- arch/x86/platform/efi/quirks.c | 169 5 files changed, 183 insertions(+), 2 deletions(-) diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt index 1d6f045..0efded6 100644 --- a/Documentation/kernel-parameters.txt +++ b/Documentation/kernel-parameters.txt @@ -1092,6 +1092,14 @@ bytes respectively. Such letter suffixes can also be entirely omitted. you are really sure that your UEFI does sane gc and fulfills the spec otherwise your board may brick. + efi_fake_mem_mirror=nn[KMG]@ss[KMG][,nn[KMG]@ss[KMG],..] [EFI; X86] + Mark specific memory as mirrored memory and update + EFI memory map. + Region of memory to be marked is from ss to ss+nn. + Using this parameter you can do debugging of Address + Range Mirroring feature even if your box doesn't support + it. + eisa_irq_edge= [PARISC,HW] See header of drivers/parisc/eisa.c. diff --git a/arch/x86/include/asm/efi.h b/arch/x86/include/asm/efi.h index 155162e..50e53cc 100644 --- a/arch/x86/include/asm/efi.h +++ b/arch/x86/include/asm/efi.h @@ -93,6 +93,7 @@ extern void __init efi_set_executable(efi_memory_desc_t *md, bool executable); extern int __init efi_memblock_x86_reserve_range(void); extern pgd_t * __init efi_call_phys_prolog(void); extern void __init efi_call_phys_epilog(pgd_t *save_pgd); +extern void __init print_efi_memmap(void); extern void __init efi_unmap_memmap(void); extern void __init efi_memory_uc(u64 addr, unsigned long size); extern void __init efi_map_region(efi_memory_desc_t *md); @@ -107,6 +108,7 @@ extern void __init efi_dump_pagetable(void); extern void __init efi_apply_memmap_quirks(void); extern int __init efi_reuse_config(u64 tables, int nr_tables); extern void efi_delete_dummy_variable(void); +extern void __init efi_fake_memmap(void); struct efi_setup_data { u64 fw_vendor; diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c index 80f874b..e3ed628 100644 --- a/arch/x86/kernel/setup.c +++ b/arch/x86/kernel/setup.c @@ -1104,8 +1104,10 @@ void __init setup_arch(char **cmdline_p) memblock_set_current_limit(ISA_END_ADDRESS); memblock_x86_fill(); - if (efi_enabled(EFI_BOOT)) + if (efi_enabled(EFI_BOOT)) { + efi_fake_memmap(); efi_find_mirror
[PATCH 0/2][RFC] Introduce efi_fake_mem_mirror boot option
UEFI spec 2.5 introduces new Memory Attribute Definition named EFI_MEMORY_MORE_RELIABLE which indicates which memory ranges are mirrored. Now linux kernel can recognize which memory ranges are mirrored by handling EFI_MEMORY_MORE_RELIABLE attributes. However testing this feature necesitates boxes with UEFI spec 2.5 complied firmware. This patchset introduces new boot option named efi_fake_mem_mirror. By specifying this parameter, you can mark specific memory as mirrored memory. This is useful for debugging of Memory Address Range Mirroring feature. Taku Izumi (2): efi: Add EFI_MEMORY_MORE_RELIABLE support to efi_md_typeattr_format() x86, efi: Add efi_fake_mem_mirror boot option Documentation/kernel-parameters.txt | 8 ++ arch/x86/include/asm/efi.h | 2 + arch/x86/kernel/setup.c | 4 +- arch/x86/platform/efi/efi.c | 2 +- arch/x86/platform/efi/quirks.c | 169 drivers/firmware/efi/efi.c | 6 +- 6 files changed, 187 insertions(+), 4 deletions(-) -- 1.8.3.1 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v2] perf: Fix multi-segment problem of perf_event_intel_uncore
In multi-segment system, uncore devices may belong to buses whose segment number is other than 0. :ff:10.5 System peripheral: Intel Corporation Xeon E5 v3/Core i7 Scratchpad & Semaphore Registers (rev 03) ... 0001:7f:10.5 System peripheral: Intel Corporation Xeon E5 v3/Core i7 Scratchpad & Semaphore Registers (rev 03) ... 0001:bf:10.5 System peripheral: Intel Corporation Xeon E5 v3/Core i7 Scratchpad & Semaphore Registers (rev 03) ... 0001:ff:10.5 System peripheral: Intel Corporation Xeon E5 v3/Core i7 Scratchpad & Semaphore Registers (rev 03 ... In that case relation of bus number and physical id may be broken because "uncore_pcibus_to_physid" doesn't take account of PCI segment. For example, bus :ff and 0001:ff uses the same entry of "uncore_pcibus_to_physid" array. This patch fixes ths problem by introducing segment-aware pci2phy_map instead. v1->v2: - Extract method named uncore_pcibus_to_physid to avoid repetetion of retrieving phys_id code Signed-off-by: Taku Izumi --- arch/x86/kernel/cpu/perf_event_intel_uncore.c | 25 -- arch/x86/kernel/cpu/perf_event_intel_uncore.h | 11 - arch/x86/kernel/cpu/perf_event_intel_uncore_snb.c | 23 +- .../x86/kernel/cpu/perf_event_intel_uncore_snbep.c | 53 -- 4 files changed, 94 insertions(+), 18 deletions(-) diff --git a/arch/x86/kernel/cpu/perf_event_intel_uncore.c b/arch/x86/kernel/cpu/perf_event_intel_uncore.c index 21b5e38..0ed6f2b 100644 --- a/arch/x86/kernel/cpu/perf_event_intel_uncore.c +++ b/arch/x86/kernel/cpu/perf_event_intel_uncore.c @@ -7,7 +7,8 @@ struct intel_uncore_type **uncore_pci_uncores = empty_uncore; static bool pcidrv_registered; struct pci_driver *uncore_pci_driver; /* pci bus to socket mapping */ -int uncore_pcibus_to_physid[256] = { [0 ... 255] = -1, }; +DEFINE_RAW_SPINLOCK(pci2phy_map_lock); +struct list_head pci2phy_map_head = LIST_HEAD_INIT(pci2phy_map_head); struct pci_dev *uncore_extra_pci_dev[UNCORE_SOCKET_MAX][UNCORE_EXTRA_PCI_DEV_MAX]; static DEFINE_RAW_SPINLOCK(uncore_box_lock); @@ -20,6 +21,23 @@ static struct event_constraint uncore_constraint_fixed = struct event_constraint uncore_constraint_empty = EVENT_CONSTRAINT(0, 0, 0); +int uncore_pcibus_to_physid(struct pci_bus *bus) +{ + int phys_id = -1; + struct pci2phy_map *map; + + raw_spin_lock(_map_lock); + list_for_each_entry(map, _map_head, list) { + if (map->segment == pci_domain_nr(bus)) { + phys_id = map->pbus_to_physid[bus->number]; + break; + } + } + raw_spin_unlock(_map_lock); + + return phys_id; +} + ssize_t uncore_event_show(struct kobject *kobj, struct kobj_attribute *attr, char *buf) { @@ -809,7 +827,7 @@ static int uncore_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id int phys_id; bool first_box = false; - phys_id = uncore_pcibus_to_physid[pdev->bus->number]; + phys_id = uncore_pcibus_to_physid(pdev->bus); if (phys_id < 0) return -ENODEV; @@ -856,9 +874,10 @@ static void uncore_pci_remove(struct pci_dev *pdev) { struct intel_uncore_box *box = pci_get_drvdata(pdev); struct intel_uncore_pmu *pmu; - int i, cpu, phys_id = uncore_pcibus_to_physid[pdev->bus->number]; + int i, cpu, phys_id; bool last_box = false; + phys_id = uncore_pcibus_to_physid(pdev->bus); box = pci_get_drvdata(pdev); if (!box) { for (i = 0; i < UNCORE_EXTRA_PCI_DEV_MAX; i++) { diff --git a/arch/x86/kernel/cpu/perf_event_intel_uncore.h b/arch/x86/kernel/cpu/perf_event_intel_uncore.h index 0f77f0a..6c96ee9 100644 --- a/arch/x86/kernel/cpu/perf_event_intel_uncore.h +++ b/arch/x86/kernel/cpu/perf_event_intel_uncore.h @@ -117,6 +117,14 @@ struct uncore_event_desc { const char *config; }; +struct pci2phy_map { + struct list_head list; + int segment; + int pbus_to_physid[256]; +}; + +int uncore_pcibus_to_physid(struct pci_bus *bus); + ssize_t uncore_event_show(struct kobject *kobj, struct kobj_attribute *attr, char *buf); @@ -317,7 +325,8 @@ u64 uncore_shared_reg_config(struct intel_uncore_box *box, int idx); extern struct intel_uncore_type **uncore_msr_uncores; extern struct intel_uncore_type **uncore_pci_uncores; extern struct pci_driver *uncore_pci_driver; -extern int uncore_pcibus_to_physid[256]; +extern raw_spinlock_t pci2phy_map_lock; +extern struct list_head pci2phy_map_head; extern struct pci_dev *uncore_extra_pci_dev[UNCORE_SOCKET_MAX][UNCORE_EXTRA_PCI_DEV_MAX]; extern struct event_constraint uncore_constraint_empty; diff --git a/arch/x86/kernel/cpu/perf_event_intel_uncore_snb.c b/arch/x86/kernel/cpu/perf_event_intel_uncore_snb.c index b005a78..ccbc817 10064
[PATCH v2] perf: Fix multi-segment problem of perf_event_intel_uncore
In multi-segment system, uncore devices may belong to buses whose segment number is other than 0. :ff:10.5 System peripheral: Intel Corporation Xeon E5 v3/Core i7 Scratchpad Semaphore Registers (rev 03) ... 0001:7f:10.5 System peripheral: Intel Corporation Xeon E5 v3/Core i7 Scratchpad Semaphore Registers (rev 03) ... 0001:bf:10.5 System peripheral: Intel Corporation Xeon E5 v3/Core i7 Scratchpad Semaphore Registers (rev 03) ... 0001:ff:10.5 System peripheral: Intel Corporation Xeon E5 v3/Core i7 Scratchpad Semaphore Registers (rev 03 ... In that case relation of bus number and physical id may be broken because uncore_pcibus_to_physid doesn't take account of PCI segment. For example, bus :ff and 0001:ff uses the same entry of uncore_pcibus_to_physid array. This patch fixes ths problem by introducing segment-aware pci2phy_map instead. v1-v2: - Extract method named uncore_pcibus_to_physid to avoid repetetion of retrieving phys_id code Signed-off-by: Taku Izumi izumi.t...@jp.fujitsu.com --- arch/x86/kernel/cpu/perf_event_intel_uncore.c | 25 -- arch/x86/kernel/cpu/perf_event_intel_uncore.h | 11 - arch/x86/kernel/cpu/perf_event_intel_uncore_snb.c | 23 +- .../x86/kernel/cpu/perf_event_intel_uncore_snbep.c | 53 -- 4 files changed, 94 insertions(+), 18 deletions(-) diff --git a/arch/x86/kernel/cpu/perf_event_intel_uncore.c b/arch/x86/kernel/cpu/perf_event_intel_uncore.c index 21b5e38..0ed6f2b 100644 --- a/arch/x86/kernel/cpu/perf_event_intel_uncore.c +++ b/arch/x86/kernel/cpu/perf_event_intel_uncore.c @@ -7,7 +7,8 @@ struct intel_uncore_type **uncore_pci_uncores = empty_uncore; static bool pcidrv_registered; struct pci_driver *uncore_pci_driver; /* pci bus to socket mapping */ -int uncore_pcibus_to_physid[256] = { [0 ... 255] = -1, }; +DEFINE_RAW_SPINLOCK(pci2phy_map_lock); +struct list_head pci2phy_map_head = LIST_HEAD_INIT(pci2phy_map_head); struct pci_dev *uncore_extra_pci_dev[UNCORE_SOCKET_MAX][UNCORE_EXTRA_PCI_DEV_MAX]; static DEFINE_RAW_SPINLOCK(uncore_box_lock); @@ -20,6 +21,23 @@ static struct event_constraint uncore_constraint_fixed = struct event_constraint uncore_constraint_empty = EVENT_CONSTRAINT(0, 0, 0); +int uncore_pcibus_to_physid(struct pci_bus *bus) +{ + int phys_id = -1; + struct pci2phy_map *map; + + raw_spin_lock(pci2phy_map_lock); + list_for_each_entry(map, pci2phy_map_head, list) { + if (map-segment == pci_domain_nr(bus)) { + phys_id = map-pbus_to_physid[bus-number]; + break; + } + } + raw_spin_unlock(pci2phy_map_lock); + + return phys_id; +} + ssize_t uncore_event_show(struct kobject *kobj, struct kobj_attribute *attr, char *buf) { @@ -809,7 +827,7 @@ static int uncore_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id int phys_id; bool first_box = false; - phys_id = uncore_pcibus_to_physid[pdev-bus-number]; + phys_id = uncore_pcibus_to_physid(pdev-bus); if (phys_id 0) return -ENODEV; @@ -856,9 +874,10 @@ static void uncore_pci_remove(struct pci_dev *pdev) { struct intel_uncore_box *box = pci_get_drvdata(pdev); struct intel_uncore_pmu *pmu; - int i, cpu, phys_id = uncore_pcibus_to_physid[pdev-bus-number]; + int i, cpu, phys_id; bool last_box = false; + phys_id = uncore_pcibus_to_physid(pdev-bus); box = pci_get_drvdata(pdev); if (!box) { for (i = 0; i UNCORE_EXTRA_PCI_DEV_MAX; i++) { diff --git a/arch/x86/kernel/cpu/perf_event_intel_uncore.h b/arch/x86/kernel/cpu/perf_event_intel_uncore.h index 0f77f0a..6c96ee9 100644 --- a/arch/x86/kernel/cpu/perf_event_intel_uncore.h +++ b/arch/x86/kernel/cpu/perf_event_intel_uncore.h @@ -117,6 +117,14 @@ struct uncore_event_desc { const char *config; }; +struct pci2phy_map { + struct list_head list; + int segment; + int pbus_to_physid[256]; +}; + +int uncore_pcibus_to_physid(struct pci_bus *bus); + ssize_t uncore_event_show(struct kobject *kobj, struct kobj_attribute *attr, char *buf); @@ -317,7 +325,8 @@ u64 uncore_shared_reg_config(struct intel_uncore_box *box, int idx); extern struct intel_uncore_type **uncore_msr_uncores; extern struct intel_uncore_type **uncore_pci_uncores; extern struct pci_driver *uncore_pci_driver; -extern int uncore_pcibus_to_physid[256]; +extern raw_spinlock_t pci2phy_map_lock; +extern struct list_head pci2phy_map_head; extern struct pci_dev *uncore_extra_pci_dev[UNCORE_SOCKET_MAX][UNCORE_EXTRA_PCI_DEV_MAX]; extern struct event_constraint uncore_constraint_empty; diff --git a/arch/x86/kernel/cpu/perf_event_intel_uncore_snb.c b/arch/x86/kernel/cpu/perf_event_intel_uncore_snb.c index b005a78..ccbc817 100644 --- a/arch/x86/kernel/cpu
[PATCH] perf: Fix multi-segment problem of perf_event_intel_uncore
In multi-segment system, uncore devices may belong to buses whose segment number is other than 0. :ff:10.5 System peripheral: Intel Corporation Xeon E5 v3/Core i7 Scratchpad & Semaphore Registers (rev 03) ... 0001:7f:10.5 System peripheral: Intel Corporation Xeon E5 v3/Core i7 Scratchpad & Semaphore Registers (rev 03) ... 0001:bf:10.5 System peripheral: Intel Corporation Xeon E5 v3/Core i7 Scratchpad & Semaphore Registers (rev 03) ... 0001:ff:10.5 System peripheral: Intel Corporation Xeon E5 v3/Core i7 Scratchpad & Semaphore Registers (rev 03 ... In that case relation of bus number and physical id may be broken because "uncore_pcibus_to_physid" doesn't take account of PCI segment. For example, bus :ff and 0001:ff uses the same entry of "uncore_pcibus_to_physid" array. This patch fixes ths problem by introducing segment-aware pci2phy_map instead. Signed-off-by: Taku Izumi --- arch/x86/kernel/cpu/perf_event_intel_uncore.c | 27 +++--- arch/x86/kernel/cpu/perf_event_intel_uncore.h | 9 - arch/x86/kernel/cpu/perf_event_intel_uncore_snb.c | 23 +++- .../x86/kernel/cpu/perf_event_intel_uncore_snbep.c | 41 ++ 4 files changed, 87 insertions(+), 13 deletions(-) diff --git a/arch/x86/kernel/cpu/perf_event_intel_uncore.c b/arch/x86/kernel/cpu/perf_event_intel_uncore.c index 21b5e38..78c8686 100644 --- a/arch/x86/kernel/cpu/perf_event_intel_uncore.c +++ b/arch/x86/kernel/cpu/perf_event_intel_uncore.c @@ -7,7 +7,8 @@ struct intel_uncore_type **uncore_pci_uncores = empty_uncore; static bool pcidrv_registered; struct pci_driver *uncore_pci_driver; /* pci bus to socket mapping */ -int uncore_pcibus_to_physid[256] = { [0 ... 255] = -1, }; +DEFINE_RAW_SPINLOCK(pci2phy_map_lock); +struct list_head pci2phy_map_head = LIST_HEAD_INIT(pci2phy_map_head); struct pci_dev *uncore_extra_pci_dev[UNCORE_SOCKET_MAX][UNCORE_EXTRA_PCI_DEV_MAX]; static DEFINE_RAW_SPINLOCK(uncore_box_lock); @@ -806,10 +807,18 @@ static int uncore_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id struct intel_uncore_pmu *pmu; struct intel_uncore_box *box; struct intel_uncore_type *type; - int phys_id; + int phys_id = -1; bool first_box = false; + struct pci2phy_map *map; - phys_id = uncore_pcibus_to_physid[pdev->bus->number]; + raw_spin_lock(_map_lock); + list_for_each_entry(map, _map_head, list) { + if (map->segment == pci_domain_nr(pdev->bus)) { + phys_id = map->pbus_to_physid[pdev->bus->number]; + break; + } + } + raw_spin_unlock(_map_lock); if (phys_id < 0) return -ENODEV; @@ -856,8 +865,18 @@ static void uncore_pci_remove(struct pci_dev *pdev) { struct intel_uncore_box *box = pci_get_drvdata(pdev); struct intel_uncore_pmu *pmu; - int i, cpu, phys_id = uncore_pcibus_to_physid[pdev->bus->number]; + int i, cpu, phys_id = -1; bool last_box = false; + struct pci2phy_map *map; + + raw_spin_lock(_map_lock); + list_for_each_entry(map, _map_head, list) { + if (map->segment == pci_domain_nr(pdev->bus)) { + phys_id = map->pbus_to_physid[pdev->bus->number]; + break; + } + } + raw_spin_unlock(_map_lock); box = pci_get_drvdata(pdev); if (!box) { diff --git a/arch/x86/kernel/cpu/perf_event_intel_uncore.h b/arch/x86/kernel/cpu/perf_event_intel_uncore.h index 0f77f0a..0fb2a23 100644 --- a/arch/x86/kernel/cpu/perf_event_intel_uncore.h +++ b/arch/x86/kernel/cpu/perf_event_intel_uncore.h @@ -117,6 +117,12 @@ struct uncore_event_desc { const char *config; }; +struct pci2phy_map { + struct list_head list; + int segment; + int pbus_to_physid[256]; +}; + ssize_t uncore_event_show(struct kobject *kobj, struct kobj_attribute *attr, char *buf); @@ -317,7 +323,8 @@ u64 uncore_shared_reg_config(struct intel_uncore_box *box, int idx); extern struct intel_uncore_type **uncore_msr_uncores; extern struct intel_uncore_type **uncore_pci_uncores; extern struct pci_driver *uncore_pci_driver; -extern int uncore_pcibus_to_physid[256]; +extern raw_spinlock_t pci2phy_map_lock; +extern struct list_head pci2phy_map_head; extern struct pci_dev *uncore_extra_pci_dev[UNCORE_SOCKET_MAX][UNCORE_EXTRA_PCI_DEV_MAX]; extern struct event_constraint uncore_constraint_empty; diff --git a/arch/x86/kernel/cpu/perf_event_intel_uncore_snb.c b/arch/x86/kernel/cpu/perf_event_intel_uncore_snb.c index b005a78..ccbc817 100644 --- a/arch/x86/kernel/cpu/perf_event_intel_uncore_snb.c +++ b/arch/x86/kernel/cpu/perf_event_intel_uncore_snb.c @@ -402,14 +402,35 @@ static int snb_pci2phy_map_init(int devid) {
[PATCH] perf: Fix multi-segment problem of perf_event_intel_uncore
In multi-segment system, uncore devices may belong to buses whose segment number is other than 0. :ff:10.5 System peripheral: Intel Corporation Xeon E5 v3/Core i7 Scratchpad Semaphore Registers (rev 03) ... 0001:7f:10.5 System peripheral: Intel Corporation Xeon E5 v3/Core i7 Scratchpad Semaphore Registers (rev 03) ... 0001:bf:10.5 System peripheral: Intel Corporation Xeon E5 v3/Core i7 Scratchpad Semaphore Registers (rev 03) ... 0001:ff:10.5 System peripheral: Intel Corporation Xeon E5 v3/Core i7 Scratchpad Semaphore Registers (rev 03 ... In that case relation of bus number and physical id may be broken because uncore_pcibus_to_physid doesn't take account of PCI segment. For example, bus :ff and 0001:ff uses the same entry of uncore_pcibus_to_physid array. This patch fixes ths problem by introducing segment-aware pci2phy_map instead. Signed-off-by: Taku Izumi izumi.t...@jp.fujitsu.com --- arch/x86/kernel/cpu/perf_event_intel_uncore.c | 27 +++--- arch/x86/kernel/cpu/perf_event_intel_uncore.h | 9 - arch/x86/kernel/cpu/perf_event_intel_uncore_snb.c | 23 +++- .../x86/kernel/cpu/perf_event_intel_uncore_snbep.c | 41 ++ 4 files changed, 87 insertions(+), 13 deletions(-) diff --git a/arch/x86/kernel/cpu/perf_event_intel_uncore.c b/arch/x86/kernel/cpu/perf_event_intel_uncore.c index 21b5e38..78c8686 100644 --- a/arch/x86/kernel/cpu/perf_event_intel_uncore.c +++ b/arch/x86/kernel/cpu/perf_event_intel_uncore.c @@ -7,7 +7,8 @@ struct intel_uncore_type **uncore_pci_uncores = empty_uncore; static bool pcidrv_registered; struct pci_driver *uncore_pci_driver; /* pci bus to socket mapping */ -int uncore_pcibus_to_physid[256] = { [0 ... 255] = -1, }; +DEFINE_RAW_SPINLOCK(pci2phy_map_lock); +struct list_head pci2phy_map_head = LIST_HEAD_INIT(pci2phy_map_head); struct pci_dev *uncore_extra_pci_dev[UNCORE_SOCKET_MAX][UNCORE_EXTRA_PCI_DEV_MAX]; static DEFINE_RAW_SPINLOCK(uncore_box_lock); @@ -806,10 +807,18 @@ static int uncore_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id struct intel_uncore_pmu *pmu; struct intel_uncore_box *box; struct intel_uncore_type *type; - int phys_id; + int phys_id = -1; bool first_box = false; + struct pci2phy_map *map; - phys_id = uncore_pcibus_to_physid[pdev-bus-number]; + raw_spin_lock(pci2phy_map_lock); + list_for_each_entry(map, pci2phy_map_head, list) { + if (map-segment == pci_domain_nr(pdev-bus)) { + phys_id = map-pbus_to_physid[pdev-bus-number]; + break; + } + } + raw_spin_unlock(pci2phy_map_lock); if (phys_id 0) return -ENODEV; @@ -856,8 +865,18 @@ static void uncore_pci_remove(struct pci_dev *pdev) { struct intel_uncore_box *box = pci_get_drvdata(pdev); struct intel_uncore_pmu *pmu; - int i, cpu, phys_id = uncore_pcibus_to_physid[pdev-bus-number]; + int i, cpu, phys_id = -1; bool last_box = false; + struct pci2phy_map *map; + + raw_spin_lock(pci2phy_map_lock); + list_for_each_entry(map, pci2phy_map_head, list) { + if (map-segment == pci_domain_nr(pdev-bus)) { + phys_id = map-pbus_to_physid[pdev-bus-number]; + break; + } + } + raw_spin_unlock(pci2phy_map_lock); box = pci_get_drvdata(pdev); if (!box) { diff --git a/arch/x86/kernel/cpu/perf_event_intel_uncore.h b/arch/x86/kernel/cpu/perf_event_intel_uncore.h index 0f77f0a..0fb2a23 100644 --- a/arch/x86/kernel/cpu/perf_event_intel_uncore.h +++ b/arch/x86/kernel/cpu/perf_event_intel_uncore.h @@ -117,6 +117,12 @@ struct uncore_event_desc { const char *config; }; +struct pci2phy_map { + struct list_head list; + int segment; + int pbus_to_physid[256]; +}; + ssize_t uncore_event_show(struct kobject *kobj, struct kobj_attribute *attr, char *buf); @@ -317,7 +323,8 @@ u64 uncore_shared_reg_config(struct intel_uncore_box *box, int idx); extern struct intel_uncore_type **uncore_msr_uncores; extern struct intel_uncore_type **uncore_pci_uncores; extern struct pci_driver *uncore_pci_driver; -extern int uncore_pcibus_to_physid[256]; +extern raw_spinlock_t pci2phy_map_lock; +extern struct list_head pci2phy_map_head; extern struct pci_dev *uncore_extra_pci_dev[UNCORE_SOCKET_MAX][UNCORE_EXTRA_PCI_DEV_MAX]; extern struct event_constraint uncore_constraint_empty; diff --git a/arch/x86/kernel/cpu/perf_event_intel_uncore_snb.c b/arch/x86/kernel/cpu/perf_event_intel_uncore_snb.c index b005a78..ccbc817 100644 --- a/arch/x86/kernel/cpu/perf_event_intel_uncore_snb.c +++ b/arch/x86/kernel/cpu/perf_event_intel_uncore_snb.c @@ -402,14 +402,35 @@ static int snb_pci2phy_map_init(int devid) { struct pci_dev *dev = NULL; int bus
RE: [RFC PATCH 0/2 shit_A shit_B] workqueue: fix wq_numa bug
> This patches are un-changloged, un-compiled, un-booted, un-tested, > they are just shits, I even hope them un-sent or blocked. > > The patches include two -solutions-: > > Shit_A: > workqueue: reset pool->node and unhash the pool when the node is > offline > update wq_numa when cpu_present_mask changed > > kernel/workqueue.c | 107 > + > 1 file changed, 84 insertions(+), 23 deletions(-) > > > Shit_B: > workqueue: reset pool->node and unhash the pool when the node is > offline > workqueue: remove wq_numa_possible_cpumask > workqueue: directly update attrs of pools when cpu hot[un]plug > > kernel/workqueue.c | 135 > +++-- > 1 file changed, 101 insertions(+), 34 deletions(-) > I tried your patchsets. linux-3.18.3 + Shit_A: Build OK. I tried to reproduce the problem that Ishimatsu had reported, but it doesn't occur. It seems that your patch fixes this problem. linux-3.18.3 + Shit_B: Build OK, but I encountered kernel panic at boot time. [0.189000] BUG: unable to handle kernel NULL pointer dereference at 0008 [0.189000] IP: [] __list_add+0x16/0xc0 [0.189000] PGD 0 [0.189000] Oops: [#1] SMP [0.189000] Modules linked in: [0.189000] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 3.18.3+ #3 [0.189000] Hardware name: FUJITSU PRIMEQUEST2800E/SB, BIOS PRIMEQUEST 2000 Series BIOS Version 01.81 12/03/2014 [0.189000] task: 880869678000 ti: 880869664000 task.ti: 880869664000 [0.189000] RIP: 0010:[] [] __list_add+0x16/0xc0 [0.189000] RSP: :880869667be8 EFLAGS: 00010296 [0.189000] RAX: 88087f83cda8 RBX: 88087f83cd80 RCX: [0.189000] RDX: RSI: 88086912bb98 RDI: 88087f83cd80 [0.189000] RBP: 880869667c08 R08: R09: 88087f807480 [0.189000] R10: 810911b6 R11: 810956ac R12: [0.189000] R13: 88086912bb98 R14: 0400 R15: 0400 [0.189000] FS: () GS:88087fc0() knlGS: [0.189000] CS: 0010 DS: ES: CR0: 80050033 [0.189000] CR2: 0008 CR3: 01998000 CR4: 001407f0 [0.189000] Stack: [0.189000] 000a 88086912b800 88087f83cd00 88087f80c000 [0.189000] 880869667c48 810912c8 880869667c28 88087f803f00 [0.189000] fff4 88086964b760 88086964b6a0 88086964b740 [0.189000] Call Trace: [0.189000] [] alloc_unbound_pwq+0x298/0x3b0 [0.189000] [] apply_workqueue_attrs+0x158/0x4c0 [0.189000] [] __alloc_workqueue_key+0x174/0x5b0 [0.189000] [] ? alloc_cpumask_var_node+0x56/0x80 [0.189000] [] init_workqueues+0x33d/0x40f [0.189000] [] ? ftrace_define_fields_workqueue_execute_start+0x6a/0x6a [0.189000] [] do_one_initcall+0xd4/0x210 [0.189000] [] ? native_smp_prepare_cpus+0x34d/0x352 [0.189000] [] kernel_init_freeable+0xf5/0x23c [0.189000] [] ? rest_init+0x80/0x80 [0.189000] [] kernel_init+0xe/0xf0 [0.189000] [] ret_from_fork+0x7c/0xb0 [0.189000] [] ? rest_init+0x80/0x80 [0.189000] Code: ff b8 f4 ff ff ff e9 3b ff ff ff b8 f4 ff ff ff e9 31 ff ff ff 55 48 89 e5 41 55 49 89 f5 41 54 49 89 d4 53 48 89 fb 48 83 ec 08 <4c> 8b 42 08 49 39 f0 75 2e 4d 8b 45 00 4d 39 c4 75 6c 4c 39 e3 [0.189000] RIP [] __list_add+0x16/0xc0 [0.189000] RSP [0.189000] CR2: 0008 [0.189000] ---[ end trace 58feee6875cf67cf ]--- [0.189000] Kernel panic - not syncing: Fatal exception [0.189000] ---[ end Kernel panic - not syncing: Fatal exception Sincerely, Taku Izumi > Both patch1 of the both solutions are: reset pool->node and unhash the pool, > it is suggested by TJ, I found it is a good leading-step for fixing the bug. > > The other patches are handling wq_numa_possible_cpumask where the solutions > diverge. > > Solution_A uses present_mask rather than possible_cpumask. It adds > wq_numa_notify_cpu_present_set/cleared() for notifications of > the changes of cpu_present_mask. But the notifications are un-existed > right now, so I fake one (wq_numa_check_present_cpumask_changes()) > to imitate them. I hope the memory people add a real one. > > Solution_B uses online_mask rather than possible_cpumask. > this solution remove more coupling between numa_code and workqueue, > it just depends on cpumask_of_node(node). > > Patch2_of_Solution_B removes the wq_numa_possible_cpumask and add > overhead when cpu hot[un]plug, Patch3 reduce this overhead. > > Thanks, > Lai > > > Reported-by: Yasuaki Ishimatsu > Cc: Tejun Heo
RE: [RFC PATCH 0/2 shit_A shit_B] workqueue: fix wq_numa bug
This patches are un-changloged, un-compiled, un-booted, un-tested, they are just shits, I even hope them un-sent or blocked. The patches include two -solutions-: Shit_A: workqueue: reset pool-node and unhash the pool when the node is offline update wq_numa when cpu_present_mask changed kernel/workqueue.c | 107 + 1 file changed, 84 insertions(+), 23 deletions(-) Shit_B: workqueue: reset pool-node and unhash the pool when the node is offline workqueue: remove wq_numa_possible_cpumask workqueue: directly update attrs of pools when cpu hot[un]plug kernel/workqueue.c | 135 +++-- 1 file changed, 101 insertions(+), 34 deletions(-) I tried your patchsets. linux-3.18.3 + Shit_A: Build OK. I tried to reproduce the problem that Ishimatsu had reported, but it doesn't occur. It seems that your patch fixes this problem. linux-3.18.3 + Shit_B: Build OK, but I encountered kernel panic at boot time. [0.189000] BUG: unable to handle kernel NULL pointer dereference at 0008 [0.189000] IP: [8131ef96] __list_add+0x16/0xc0 [0.189000] PGD 0 [0.189000] Oops: [#1] SMP [0.189000] Modules linked in: [0.189000] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 3.18.3+ #3 [0.189000] Hardware name: FUJITSU PRIMEQUEST2800E/SB, BIOS PRIMEQUEST 2000 Series BIOS Version 01.81 12/03/2014 [0.189000] task: 880869678000 ti: 880869664000 task.ti: 880869664000 [0.189000] RIP: 0010:[8131ef96] [8131ef96] __list_add+0x16/0xc0 [0.189000] RSP: :880869667be8 EFLAGS: 00010296 [0.189000] RAX: 88087f83cda8 RBX: 88087f83cd80 RCX: [0.189000] RDX: RSI: 88086912bb98 RDI: 88087f83cd80 [0.189000] RBP: 880869667c08 R08: R09: 88087f807480 [0.189000] R10: 810911b6 R11: 810956ac R12: [0.189000] R13: 88086912bb98 R14: 0400 R15: 0400 [0.189000] FS: () GS:88087fc0() knlGS: [0.189000] CS: 0010 DS: ES: CR0: 80050033 [0.189000] CR2: 0008 CR3: 01998000 CR4: 001407f0 [0.189000] Stack: [0.189000] 000a 88086912b800 88087f83cd00 88087f80c000 [0.189000] 880869667c48 810912c8 880869667c28 88087f803f00 [0.189000] fff4 88086964b760 88086964b6a0 88086964b740 [0.189000] Call Trace: [0.189000] [810912c8] alloc_unbound_pwq+0x298/0x3b0 [0.189000] [81091ce8] apply_workqueue_attrs+0x158/0x4c0 [0.189000] [81092424] __alloc_workqueue_key+0x174/0x5b0 [0.189000] [813052a6] ? alloc_cpumask_var_node+0x56/0x80 [0.189000] [81b21573] init_workqueues+0x33d/0x40f [0.189000] [81b21236] ? ftrace_define_fields_workqueue_execute_start+0x6a/0x6a [0.189000] [81002144] do_one_initcall+0xd4/0x210 [0.189000] [81b12f4d] ? native_smp_prepare_cpus+0x34d/0x352 [0.189000] [81b0026d] kernel_init_freeable+0xf5/0x23c [0.189000] [81653370] ? rest_init+0x80/0x80 [0.189000] [8165337e] kernel_init+0xe/0xf0 [0.189000] [8166bcfc] ret_from_fork+0x7c/0xb0 [0.189000] [81653370] ? rest_init+0x80/0x80 [0.189000] Code: ff b8 f4 ff ff ff e9 3b ff ff ff b8 f4 ff ff ff e9 31 ff ff ff 55 48 89 e5 41 55 49 89 f5 41 54 49 89 d4 53 48 89 fb 48 83 ec 08 4c 8b 42 08 49 39 f0 75 2e 4d 8b 45 00 4d 39 c4 75 6c 4c 39 e3 [0.189000] RIP [8131ef96] __list_add+0x16/0xc0 [0.189000] RSP 880869667be8 [0.189000] CR2: 0008 [0.189000] ---[ end trace 58feee6875cf67cf ]--- [0.189000] Kernel panic - not syncing: Fatal exception [0.189000] ---[ end Kernel panic - not syncing: Fatal exception Sincerely, Taku Izumi Both patch1 of the both solutions are: reset pool-node and unhash the pool, it is suggested by TJ, I found it is a good leading-step for fixing the bug. The other patches are handling wq_numa_possible_cpumask where the solutions diverge. Solution_A uses present_mask rather than possible_cpumask. It adds wq_numa_notify_cpu_present_set/cleared() for notifications of the changes of cpu_present_mask. But the notifications are un-existed right now, so I fake one (wq_numa_check_present_cpumask_changes()) to imitate them. I hope the memory people add a real one. Solution_B uses online_mask rather than possible_cpumask. this solution remove more coupling between numa_code and workqueue, it just depends on cpumask_of_node(node). Patch2_of_Solution_B removes the wq_numa_possible_cpumask and add overhead when cpu hot[un]plug, Patch3 reduce this overhead
Re: [PATCH 2/3 v2] Do not use acpi_device to find pci root bridge in _init code.
On Fri, 12 Oct 2012 20:34:20 +0800 Tang Chen wrote: > When the kernel is being initialized, and some hardwares are not added > to system, there won't be acpi_device structs for these devices. But > acpi_is_root_bridge() depends on acpi_device struct. As a result, all > the not-added root bridge will not be judged as a root bridge in > find_root_bridges(). And further more, no handle_hotplug_event_root() > notifier will be installed for them. > > This patch introduces a new api to find all root bridges in system by > getting HID directly from ACPI namespace, not depending on acpi_device > struct. How about squashing patch #2 into patch #1 ? The caller and callee should be the same place in my mind. Best regards, Taku Izumi > Signed-off-by: Tang Chen > Signed-off-by: Liu Jiang > --- > drivers/acpi/pci_root.c | 19 +++ > 1 files changed, 11 insertions(+), 8 deletions(-) > > diff --git a/drivers/acpi/pci_root.c b/drivers/acpi/pci_root.c > index 6151d83..582eb11 100644 > --- a/drivers/acpi/pci_root.c > +++ b/drivers/acpi/pci_root.c > @@ -129,20 +129,23 @@ EXPORT_SYMBOL_GPL(acpi_get_pci_rootbridge_handle); > * acpi_is_root_bridge - determine whether an ACPI CA node is a PCI root > bridge > * @handle - the ACPI CA node in question. > * > - * Note: we could make this API take a struct acpi_device * instead, but > - * for now, it's more convenient to operate on an acpi_handle. > + * Note: If a device is not added to the system yet, there won't be an > + * acpi_device struct for it. So do not get HID and CID from acpi_device, > + * get them from ACPI namespace directly. > */ > int acpi_is_root_bridge(acpi_handle handle) > { > - int ret; > - struct acpi_device *device; > + struct acpi_device_info *info; > + acpi_status status; > > - ret = acpi_bus_get_device(handle, ); > - if (ret) > + status = acpi_get_object_info(handle, ); > + if (ACPI_FAILURE(status)) { > + printk(KERN_ERR PREFIX "%s: Error reading" > +"device info\n", __func__); > return 0; > + } > > - ret = acpi_match_device_ids(device, root_device_ids); > - if (ret) > + if (acpi_match_object_info_ids(info, root_device_ids)) > return 0; > else > return 1; > -- > 1.7.1 > > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ > -- Taku Izumi -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/3 v2] Do not use acpi_device to find pci root bridge in _init code.
On Fri, 12 Oct 2012 20:34:20 +0800 Tang Chen tangc...@cn.fujitsu.com wrote: When the kernel is being initialized, and some hardwares are not added to system, there won't be acpi_device structs for these devices. But acpi_is_root_bridge() depends on acpi_device struct. As a result, all the not-added root bridge will not be judged as a root bridge in find_root_bridges(). And further more, no handle_hotplug_event_root() notifier will be installed for them. This patch introduces a new api to find all root bridges in system by getting HID directly from ACPI namespace, not depending on acpi_device struct. How about squashing patch #2 into patch #1 ? The caller and callee should be the same place in my mind. Best regards, Taku Izumi Signed-off-by: Tang Chen tangc...@cn.fujitsu.com Signed-off-by: Liu Jiang jiang@huawei.com --- drivers/acpi/pci_root.c | 19 +++ 1 files changed, 11 insertions(+), 8 deletions(-) diff --git a/drivers/acpi/pci_root.c b/drivers/acpi/pci_root.c index 6151d83..582eb11 100644 --- a/drivers/acpi/pci_root.c +++ b/drivers/acpi/pci_root.c @@ -129,20 +129,23 @@ EXPORT_SYMBOL_GPL(acpi_get_pci_rootbridge_handle); * acpi_is_root_bridge - determine whether an ACPI CA node is a PCI root bridge * @handle - the ACPI CA node in question. * - * Note: we could make this API take a struct acpi_device * instead, but - * for now, it's more convenient to operate on an acpi_handle. + * Note: If a device is not added to the system yet, there won't be an + * acpi_device struct for it. So do not get HID and CID from acpi_device, + * get them from ACPI namespace directly. */ int acpi_is_root_bridge(acpi_handle handle) { - int ret; - struct acpi_device *device; + struct acpi_device_info *info; + acpi_status status; - ret = acpi_bus_get_device(handle, device); - if (ret) + status = acpi_get_object_info(handle, info); + if (ACPI_FAILURE(status)) { + printk(KERN_ERR PREFIX %s: Error reading +device info\n, __func__); return 0; + } - ret = acpi_match_device_ids(device, root_device_ids); - if (ret) + if (acpi_match_object_info_ids(info, root_device_ids)) return 0; else return 1; -- 1.7.1 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ -- Taku Izumi izumi.t...@jp.fujitsu.com -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH 2/3] ACPIHP: ACPI system device hotplug slot enumerator
andled by acpiphp or pciehp drivers. > + */ > + if (type == ACPIHP_DEV_TYPE_HOST_BRIDGE) > + return AE_CTRL_DEPTH; > + > + return AE_OK; > +} > + > +/* > + * Get types of child devices connected to this slot. > + * We only care about CPU, memory, PCI host bridge and CONTAINER here. > + * Values used here must be in consistence with acpihp_enum_get_slot_type(). > + */ > +static acpi_status __init > +acpihp_enum_get_dev_type(acpi_handle handle, u32 lvl, void *context, void > **rv) > +{ > + acpi_status status = AE_OK; > + enum acpihp_dev_type type; > + u32 *tp = (u32 *)rv; > + > + if (!acpihp_dev_get_type(handle, )) { > + switch (type) { > + case ACPIHP_DEV_TYPE_CPU: > + *tp |= 0x0001; > + status = AE_CTRL_DEPTH; > + break; > + case ACPIHP_DEV_TYPE_MEM: > + *tp |= 0x0002; > + status = AE_CTRL_DEPTH; > + break; > + case ACPIHP_DEV_TYPE_HOST_BRIDGE: > + *tp |= 0x0004; > + status = AE_CTRL_DEPTH; > + break; > + case ACPIHP_DEV_TYPE_CONTAINER: > + *tp |= 0x0008; > + break; > + default: > + break; > + } > + } > + > + return status; > +} > + > +/* > + * Guess type of a hotplug slot according to child devices connecting to it. > + */ > +static enum acpihp_slot_type __init acpihp_enum_get_slot_type(u32 dev_types) > +{ > + BUG_ON(dev_types > 15); > + > + switch (dev_types) { > + case 0: > + /* Generic CONTAINER */ > + return ACPIHP_SLOT_TYPE_COMMON; > + case 1: > + /* Physical processor with logical CPUs */ > + return ACPIHP_SLOT_TYPE_CPU; > + case 2: > + /* Memory board/box with memory devices */ > + return ACPIHP_SLOT_TYPE_MEM; > + case 3: > + /* Physical processor with CPUs and memory controllers */ > + return ACPIHP_SLOT_TYPE_CPU; > + case 4: > + /* IO eXtension board/box with IO host bridges */ > + return ACPIHP_SLOT_TYPE_IOX; > + case 7: > + /* Physical processor with CPUs, IO host bridges and MCs. */ > + return ACPIHP_SLOT_TYPE_CPU; Why is this case ACPIHP_SLOT_TYPE_CPU? I think this case is ACPIHP_SLOT_TYPE_COMMON or else. By the way how about simplifying slot type category? Do we need to differentiate case7, 8, 9, 11 and 15? Best regards, Taku Izumi > + case 8: > + /* Generic CONTAINER */ > + return ACPIHP_SLOT_TYPE_COMMON; > + case 9: > + /* System board with physical processors */ > + return ACPIHP_SLOT_TYPE_SYSTEM_BOARD; > + case 11: > + /* System board with physical processors and memory */ > + return ACPIHP_SLOT_TYPE_SYSTEM_BOARD; > + case 15: > + /* Node with processor, memory and IO host bridge */ > + return ACPIHP_SLOT_TYPE_NODE; > + default: > + return ACPIHP_SLOT_TYPE_UNKNOWN; > + } > +} > + > +/* > + * Guess type of a hotplug slot according to the device type of the > + * corresponding ACPI object itself. > + */ > +static enum acpihp_slot_type __init > +acpihp_enum_check_slot_self(struct acpihp_slot *slot) > +{ > + enum acpihp_dev_type type; > + > + if (acpihp_dev_get_type(slot->handle, )) > + return ACPIHP_SLOT_TYPE_UNKNOWN; > + > + switch (type) { > + case ACPIHP_DEV_TYPE_CPU: > + /* Logical CPU used in virtualization environment */ > + return ACPIHP_SLOT_TYPE_CPU; > + case ACPIHP_DEV_TYPE_MEM: > + /* Memory board with single memory device */ > + return ACPIHP_SLOT_TYPE_MEM; > + case ACPIHP_DEV_TYPE_HOST_BRIDGE: > + /* IO eXtension board/box with single IO host bridge */ > + return ACPIHP_SLOT_TYPE_IOX; > + default: > + return ACPIHP_SLOT_TYPE_UNKNOWN; > + } > +} > + > +static int __init acpihp_enum_generate_slot_name(struct acpihp_slot *slot) > +{ > + int found = 0; > + struct list_head *list; > + struct acpihp_slot_id *slot_id; > + unsigned long long uid; > + > + /* Respect firmware settings if _UID return an integer. */ > + if (ACPI_SUCCESS(acpi_evaluate_integer(slot->handle, METHOD_NAME__UID, > +NULL, ))) > +
Re: [RFC PATCH 2/3] ACPIHP: ACPI system device hotplug slot enumerator
; + break; + case ACPIHP_DEV_TYPE_HOST_BRIDGE: + *tp |= 0x0004; + status = AE_CTRL_DEPTH; + break; + case ACPIHP_DEV_TYPE_CONTAINER: + *tp |= 0x0008; + break; + default: + break; + } + } + + return status; +} + +/* + * Guess type of a hotplug slot according to child devices connecting to it. + */ +static enum acpihp_slot_type __init acpihp_enum_get_slot_type(u32 dev_types) +{ + BUG_ON(dev_types 15); + + switch (dev_types) { + case 0: + /* Generic CONTAINER */ + return ACPIHP_SLOT_TYPE_COMMON; + case 1: + /* Physical processor with logical CPUs */ + return ACPIHP_SLOT_TYPE_CPU; + case 2: + /* Memory board/box with memory devices */ + return ACPIHP_SLOT_TYPE_MEM; + case 3: + /* Physical processor with CPUs and memory controllers */ + return ACPIHP_SLOT_TYPE_CPU; + case 4: + /* IO eXtension board/box with IO host bridges */ + return ACPIHP_SLOT_TYPE_IOX; + case 7: + /* Physical processor with CPUs, IO host bridges and MCs. */ + return ACPIHP_SLOT_TYPE_CPU; Why is this case ACPIHP_SLOT_TYPE_CPU? I think this case is ACPIHP_SLOT_TYPE_COMMON or else. By the way how about simplifying slot type category? Do we need to differentiate case7, 8, 9, 11 and 15? Best regards, Taku Izumi + case 8: + /* Generic CONTAINER */ + return ACPIHP_SLOT_TYPE_COMMON; + case 9: + /* System board with physical processors */ + return ACPIHP_SLOT_TYPE_SYSTEM_BOARD; + case 11: + /* System board with physical processors and memory */ + return ACPIHP_SLOT_TYPE_SYSTEM_BOARD; + case 15: + /* Node with processor, memory and IO host bridge */ + return ACPIHP_SLOT_TYPE_NODE; + default: + return ACPIHP_SLOT_TYPE_UNKNOWN; + } +} + +/* + * Guess type of a hotplug slot according to the device type of the + * corresponding ACPI object itself. + */ +static enum acpihp_slot_type __init +acpihp_enum_check_slot_self(struct acpihp_slot *slot) +{ + enum acpihp_dev_type type; + + if (acpihp_dev_get_type(slot-handle, type)) + return ACPIHP_SLOT_TYPE_UNKNOWN; + + switch (type) { + case ACPIHP_DEV_TYPE_CPU: + /* Logical CPU used in virtualization environment */ + return ACPIHP_SLOT_TYPE_CPU; + case ACPIHP_DEV_TYPE_MEM: + /* Memory board with single memory device */ + return ACPIHP_SLOT_TYPE_MEM; + case ACPIHP_DEV_TYPE_HOST_BRIDGE: + /* IO eXtension board/box with single IO host bridge */ + return ACPIHP_SLOT_TYPE_IOX; + default: + return ACPIHP_SLOT_TYPE_UNKNOWN; + } +} + +static int __init acpihp_enum_generate_slot_name(struct acpihp_slot *slot) +{ + int found = 0; + struct list_head *list; + struct acpihp_slot_id *slot_id; + unsigned long long uid; + + /* Respect firmware settings if _UID return an integer. */ + if (ACPI_SUCCESS(acpi_evaluate_integer(slot-handle, METHOD_NAME__UID, +NULL, uid))) + goto set_name; + + if (slot-parent) + list = slot-parent-slot_id_list; + else + list = slot_id_list; + + list_for_each_entry(slot_id, list, node) + if (slot_id-type == slot-type) { + found = 1; + break; + } + if (!found) { + slot_id = kzalloc(sizeof(struct acpihp_slot_id), GFP_KERNEL); + if (!slot_id) { + ACPIHP_DEBUG(fails to allocate slot instance ID.\n); + return -ENOMEM; + } + slot_id-type = slot-type; + list_add_tail(slot_id-node, list); + } + + uid = slot_id-instance_id++; + +set_name: + snprintf(slot-name, sizeof(slot-name) - 1, %s%02llx, + acpihp_get_slot_type_name(slot-type), uid); + dev_set_name(slot-dev, %s, slot-name); + + return 0; +} + +/* + * Generate a meaningful name for the slot according to devices connecting + * to this slot + */ +static int __init acpihp_enum_rename_slot(struct acpihp_slot *slot) +{ + u32 child_types = 0; + + slot-type = acpihp_enum_check_slot_self(slot); + if (slot-type == ACPIHP_SLOT_TYPE_UNKNOWN) { + acpi_walk_namespace(ACPI_TYPE_DEVICE, slot-handle, + ACPI_UINT32_MAX, acpihp_enum_get_dev_type, + NULL, NULL, (void **)child_types); + acpi_walk_namespace
Re: [RFC PATCH 01/14] PCI: add pcie_flags into struct pci_dev to cache PCIe capabilities register
On Tue, 10 Jul 2012 23:54:02 +0800 Jiang Liu wrote: > From: Yijing Wang > > From: Yijing Wang > > Since PCI Express Capabilities Register is read only, cache its value > into struct pci_dev to avoid repeatedly calling pci_read_config_*(). > > Signed-off-by: Yijing Wang > Signed-off-by: Jiang Liu > --- > drivers/pci/probe.c |1 + > include/linux/pci.h |1 + > 2 files changed, 2 insertions(+) > > diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c > index 6c143b4..65e82e3 100644 > --- a/drivers/pci/probe.c > +++ b/drivers/pci/probe.c > @@ -929,6 +929,7 @@ void set_pcie_port_type(struct pci_dev *pdev) > pdev->is_pcie = 1; > pdev->pcie_cap = pos; > pci_read_config_word(pdev, pos + PCI_EXP_FLAGS, ); > + pdev->pcie_flags = reg16; > pdev->pcie_type = (reg16 & PCI_EXP_FLAGS_TYPE) >> 4; > pci_read_config_word(pdev, pos + PCI_EXP_DEVCAP, ); > pdev->pcie_mpss = reg16 & PCI_EXP_DEVCAP_PAYLOAD; > diff --git a/include/linux/pci.h b/include/linux/pci.h > index 5faa831..f4a7ad6 100644 > --- a/include/linux/pci.h > +++ b/include/linux/pci.h > @@ -258,6 +258,7 @@ struct pci_dev { > u8 pcie_mpss:3;/* PCI-E Max Payload Size Supported */ > u8 rom_base_reg; /* which config register controls the > ROM */ > u8 pin;/* which interrupt pin this device uses > */ > + u16 pcie_flags; /* cached PCI-E Capabilities Register */ "xxx_flags" sounds like a bit flag. This variable stores a value of PCIe capability register, doesn't it? How about "pcie_cap_reg" ? > > struct pci_driver *driver; /* which driver has allocated this > device */ > u64 dma_mask; /* Mask of the bits of bus address this > -- > 1.7.9.5 > > -- > To unsubscribe from this list: send the line "unsubscribe linux-pci" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- Taku Izumi -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH 01/14] PCI: add pcie_flags into struct pci_dev to cache PCIe capabilities register
On Tue, 10 Jul 2012 23:54:02 +0800 Jiang Liu liu...@gmail.com wrote: From: Yijing Wang wangyij...@huawei.com From: Yijing Wang wangyij...@huawei.com Since PCI Express Capabilities Register is read only, cache its value into struct pci_dev to avoid repeatedly calling pci_read_config_*(). Signed-off-by: Yijing Wang wangyij...@huawei.com Signed-off-by: Jiang Liu liu...@gmail.com --- drivers/pci/probe.c |1 + include/linux/pci.h |1 + 2 files changed, 2 insertions(+) diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c index 6c143b4..65e82e3 100644 --- a/drivers/pci/probe.c +++ b/drivers/pci/probe.c @@ -929,6 +929,7 @@ void set_pcie_port_type(struct pci_dev *pdev) pdev-is_pcie = 1; pdev-pcie_cap = pos; pci_read_config_word(pdev, pos + PCI_EXP_FLAGS, reg16); + pdev-pcie_flags = reg16; pdev-pcie_type = (reg16 PCI_EXP_FLAGS_TYPE) 4; pci_read_config_word(pdev, pos + PCI_EXP_DEVCAP, reg16); pdev-pcie_mpss = reg16 PCI_EXP_DEVCAP_PAYLOAD; diff --git a/include/linux/pci.h b/include/linux/pci.h index 5faa831..f4a7ad6 100644 --- a/include/linux/pci.h +++ b/include/linux/pci.h @@ -258,6 +258,7 @@ struct pci_dev { u8 pcie_mpss:3;/* PCI-E Max Payload Size Supported */ u8 rom_base_reg; /* which config register controls the ROM */ u8 pin;/* which interrupt pin this device uses */ + u16 pcie_flags; /* cached PCI-E Capabilities Register */ xxx_flags sounds like a bit flag. This variable stores a value of PCIe capability register, doesn't it? How about pcie_cap_reg ? struct pci_driver *driver; /* which driver has allocated this device */ u64 dma_mask; /* Mask of the bits of bus address this -- 1.7.9.5 -- To unsubscribe from this list: send the line unsubscribe linux-pci in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- Taku Izumi izumi.t...@jp.fujitsu.com -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH][BUG] Fix the graphic corruption issue on IA64 machines
HI, As a result of the discussion with Pete Zaitcev([EMAIL PROTECTED]), I re-create a patch. This attached patch is revised version. He pointed out that the former patch may violate the assumptions and was not safe. Concretely speaking, he concerned that an unexpected problem may arise somewhere if "blank_state", which is intended to reflect the state of timer, was shuffled arround. This revised patch reflects his pointed out. I confirmed this also fixed the problem. Regards, Taku Izumi <[EMAIL PROTECTED]> Fix the graphic corruption issue on IA64 machines. VGA console driver can misunderstand the current mode(Text/Graphic) under "disable console blanking" setting. When "disable console blank" is set (blankinterval=0), "do_unblank_screen()" function returns without changing "blank_state", and when "blank_state" is "blank_off", "do_blank_screen() function returns without invoking sw->con_blank() function. That's why VGA console driver can misunderstand the current mode. Signed-off-by: Nobuhiro Tachino <[EMAIL PROTECTED]> Signed-off-by: Pete Zaitcev <[EMAIL PROTECTED]> Signed-off-by: Taku Izumi <[EMAIL PROTECTED]> --- drivers/char/vt.c |8 +--- 1 files changed, 5 insertions(+), 3 deletions(-) Index: linux-2.6.22/drivers/char/vt.c = --- linux-2.6.22.org/drivers/char/vt.c 2007-06-27 11:40:03.0 -0400 +++ linux-2.6.22/drivers/char/vt.c 2007-06-27 11:24:32.0 -0400 @@ -3491,9 +3491,6 @@ void do_blank_screen(int entering_gfx) } return; } - if (blank_state != blank_normal_wait) - return; - blank_state = blank_off; /* entering graphics mode? */ if (entering_gfx) { @@ -3501,10 +3498,15 @@ void do_blank_screen(int entering_gfx) save_screen(vc); vc->vc_sw->con_blank(vc, -1, 1); console_blanked = fg_console + 1; + blank_state = blank_off; set_origin(vc); return; } + if (blank_state != blank_normal_wait) + return; + blank_state = blank_off; + /* don't blank graphics */ if (vc->vc_mode != KD_TEXT) { console_blanked = fg_console + 1;
Re: [PATCH][BUG] Fix the graphic corruption issue on IA64 machines
HI, As a result of the discussion with Pete Zaitcev([EMAIL PROTECTED]), I re-create a patch. This attached patch is revised version. He pointed out that the former patch may violate the assumptions and was not safe. Concretely speaking, he concerned that an unexpected problem may arise somewhere if blank_state, which is intended to reflect the state of timer, was shuffled arround. This revised patch reflects his pointed out. I confirmed this also fixed the problem. Regards, Taku Izumi [EMAIL PROTECTED] Fix the graphic corruption issue on IA64 machines. VGA console driver can misunderstand the current mode(Text/Graphic) under disable console blanking setting. When disable console blank is set (blankinterval=0), do_unblank_screen() function returns without changing blank_state, and when blank_state is blank_off, do_blank_screen() function returns without invoking sw-con_blank() function. That's why VGA console driver can misunderstand the current mode. Signed-off-by: Nobuhiro Tachino [EMAIL PROTECTED] Signed-off-by: Pete Zaitcev [EMAIL PROTECTED] Signed-off-by: Taku Izumi [EMAIL PROTECTED] --- drivers/char/vt.c |8 +--- 1 files changed, 5 insertions(+), 3 deletions(-) Index: linux-2.6.22/drivers/char/vt.c = --- linux-2.6.22.org/drivers/char/vt.c 2007-06-27 11:40:03.0 -0400 +++ linux-2.6.22/drivers/char/vt.c 2007-06-27 11:24:32.0 -0400 @@ -3491,9 +3491,6 @@ void do_blank_screen(int entering_gfx) } return; } - if (blank_state != blank_normal_wait) - return; - blank_state = blank_off; /* entering graphics mode? */ if (entering_gfx) { @@ -3501,10 +3498,15 @@ void do_blank_screen(int entering_gfx) save_screen(vc); vc-vc_sw-con_blank(vc, -1, 1); console_blanked = fg_console + 1; + blank_state = blank_off; set_origin(vc); return; } + if (blank_state != blank_normal_wait) + return; + blank_state = blank_off; + /* don't blank graphics */ if (vc-vc_mode != KD_TEXT) { console_blanked = fg_console + 1;