[PATCH v5 08/10] device-dax: Add a driver for "hmem" devices
Platform firmware like EFI/ACPI may publish "hmem" platform devices. Such a device is a performance differentiated memory range likely reserved for an application specific use case. The driver gives access to 100% of the capacity via a device-dax mmap instance by default. However, if over-subscription and other kernel memory management is desired the resulting dax device can be assigned to the core-mm via the kmem driver. This consumes "hmem" devices the producer of "hmem" devices is saved for a follow-on patch so that it can reference the new CONFIG_DEV_DAX_HMEM symbol to gate performing the enumeration work. Cc: Vishal Verma Cc: Keith Busch Cc: Dave Jiang Reported-by: kbuild test robot Reviewed-by: Dave Hansen Signed-off-by: Dan Williams --- drivers/dax/Kconfig | 27 + drivers/dax/Makefile |2 ++ drivers/dax/hmem.c| 57 + include/linux/memregion.h |4 +++ 4 files changed, 85 insertions(+), 5 deletions(-) create mode 100644 drivers/dax/hmem.c diff --git a/drivers/dax/Kconfig b/drivers/dax/Kconfig index f33c73e4af41..3b6c06f07326 100644 --- a/drivers/dax/Kconfig +++ b/drivers/dax/Kconfig @@ -32,19 +32,36 @@ config DEV_DAX_PMEM Say M if unsure +config DEV_DAX_HMEM + tristate "HMEM DAX: direct access to 'specific purpose' memory" + depends on EFI_SOFT_RESERVE + default DEV_DAX + help + EFI 2.8 platforms, and others, may advertise 'specific purpose' + memory. For example, a high bandwidth memory pool. The + indication from platform firmware is meant to reserve the + memory from typical usage by default. This driver creates + device-dax instances for these memory ranges, and that also + enables the possibility to assign them to the DEV_DAX_KMEM + driver to override the reservation and add them to kernel + "System RAM" pool. + + Say M if unsure. + config DEV_DAX_KMEM tristate "KMEM DAX: volatile-use of persistent memory" default DEV_DAX depends on DEV_DAX depends on MEMORY_HOTPLUG # for add_memory() and friends help - Support access to persistent memory as if it were RAM. This - allows easier use of persistent memory by unmodified - applications. + Support access to persistent, or other performance + differentiated memory as if it were System RAM. This allows + easier use of persistent memory by unmodified applications, or + adds core kernel memory services to heterogeneous memory types + (HMEM) marked "reserved" by platform firmware. To use this feature, a DAX device must be unbound from the - device_dax driver (PMEM DAX) and bound to this kmem driver - on each boot. + device_dax driver and bound to this kmem driver on each boot. Say N if unsure. diff --git a/drivers/dax/Makefile b/drivers/dax/Makefile index 81f7d54dadfb..80065b38b3c4 100644 --- a/drivers/dax/Makefile +++ b/drivers/dax/Makefile @@ -2,9 +2,11 @@ obj-$(CONFIG_DAX) += dax.o obj-$(CONFIG_DEV_DAX) += device_dax.o obj-$(CONFIG_DEV_DAX_KMEM) += kmem.o +obj-$(CONFIG_DEV_DAX_HMEM) += dax_hmem.o dax-y := super.o dax-y += bus.o device_dax-y := device.o +dax_hmem-y := hmem.o obj-y += pmem/ diff --git a/drivers/dax/hmem.c b/drivers/dax/hmem.c new file mode 100644 index ..adbc548a0aef --- /dev/null +++ b/drivers/dax/hmem.c @@ -0,0 +1,57 @@ +// SPDX-License-Identifier: GPL-2.0 +#include +#include +#include +#include +#include "bus.h" + +static int dax_hmem_probe(struct platform_device *pdev) +{ + struct device *dev = >dev; + struct dev_pagemap pgmap = { }; + struct dax_region *dax_region; + struct memregion_info *mri; + struct dev_dax *dev_dax; + struct resource *res; + + res = platform_get_resource(pdev, IORESOURCE_MEM, 0); + if (!res) + return -ENOMEM; + + mri = dev->platform_data; + pgmap.dev = dev; + memcpy(, res, sizeof(*res)); + + dax_region = alloc_dax_region(dev, pdev->id, res, mri->target_node, + PMD_SIZE, PFN_DEV|PFN_MAP); + if (!dax_region) + return -ENOMEM; + + dev_dax = devm_create_dev_dax(dax_region, 0, ); + if (IS_ERR(dev_dax)) + return PTR_ERR(dev_dax); + + /* child dev_dax instances now own the lifetime of the dax_region */ + dax_region_put(dax_region); + return 0; +} + +static int dax_hmem_remove(struct platform_device *pdev) +{ + /* devm handles teardown */ + return 0; +} + +static struct platform_driver dax_hmem_driver = { + .probe = dax_hmem_probe, + .remove = dax_hmem_remove, + .driver = { + .name = "hmem", + }, +}; + +module_platform_driver(dax_hmem_driver); + +MODULE_ALIAS("platform:hmem*"); +MODULE_LICENSE("GPL v2");
[PATCH v5 09/10] acpi/numa/hmat: Register HMAT at device_initcall level
In preparation for registering device-dax instances for accessing EFI specific-purpose memory, arrange for the HMAT registration to occur later in the init process. Critically HMAT initialization needs to occur after e820__reserve_resources_late() which is the point at which the iomem resource tree is populated with "Application Reserved" (IORES_DESC_APPLICATION_RESERVED). e820__reserve_resources_late() happens at subsys_initcall time. Cc: "Rafael J. Wysocki" Cc: Len Brown Cc: Keith Busch Cc: Jonathan Cameron Reviewed-by: Dave Hansen Signed-off-by: Dan Williams --- drivers/acpi/numa/hmat.c |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/acpi/numa/hmat.c b/drivers/acpi/numa/hmat.c index 8f9a28a870b0..4707eb9dd07b 100644 --- a/drivers/acpi/numa/hmat.c +++ b/drivers/acpi/numa/hmat.c @@ -748,4 +748,4 @@ static __init int hmat_init(void) acpi_put_table(tbl); return 0; } -subsys_initcall(hmat_init); +device_initcall(hmat_init);
[PATCH v5 04/10] x86, efi: Reserve UEFI 2.8 Specific Purpose Memory for dax
UEFI 2.8 defines an EFI_MEMORY_SP attribute bit to augment the interpretation of the EFI Memory Types as "reserved for a specific purpose". The proposed Linux behavior for specific purpose memory is that it is reserved for direct-access (device-dax) by default and not available for any kernel usage, not even as an OOM fallback. Later, through udev scripts or another init mechanism, these device-dax claimed ranges can be reconfigured and hot-added to the available System-RAM with a unique node identifier. This device-dax management scheme implements "soft" in the "soft reserved" designation by allowing some or all of the reservation to be recovered as typical memory. This policy can be disabled at compile-time with CONFIG_EFI_SOFT_RESERVE=n, or runtime with efi=nosoftreserve. This patch introduces 2 new concepts at once given the entanglement between early boot enumeration relative to memory that can optionally be reserved from the kernel page allocator by default. The new concepts are: - E820_TYPE_SOFT_RESERVED: Upon detecting the EFI_MEMORY_SP attribute on EFI_CONVENTIONAL memory, update the E820 map with this new type. Only perform this classification if the CONFIG_EFI_SOFT_RESERVE=y policy is enabled, otherwise treat it as typical ram. - IORES_DESC_SOFT_RESERVED: Add a new I/O resource descriptor for a device driver to search iomem resources for application specific memory. Teach the iomem code to identify such ranges as "Soft Reserved". A follow-on change integrates parsing of the ACPI HMAT to identify the node and sub-range boundaries of EFI_MEMORY_SP designated memory. For now, just identify and reserve memory of this type. The translation of EFI_CONVENTIONAL_MEMORY + EFI_MEMORY_SP to "soft reserved" is x86/E820-only, but other archs could choose to publish IORES_DESC_SOFT_RESERVED resources from their platform-firmware memory map handlers. Other EFI-capable platforms would need to go audit their local usages of EFI_CONVENTIONAL_MEMORY to consider the soft reserved case. Cc: Cc: Borislav Petkov Cc: Ingo Molnar Cc: "H. Peter Anvin" Cc: Darren Hart Cc: Andy Shevchenko Cc: Andy Lutomirski Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: Ard Biesheuvel Reported-by: kbuild test robot Reviewed-by: Dave Hansen Signed-off-by: Dan Williams --- Documentation/admin-guide/kernel-parameters.txt | 19 +++-- arch/x86/Kconfig| 21 + arch/x86/boot/compressed/eboot.c|7 +++ arch/x86/boot/compressed/kaslr.c|4 ++ arch/x86/include/asm/e820/types.h |8 arch/x86/include/asm/efi-stub.h | 11 + arch/x86/kernel/e820.c | 12 + arch/x86/platform/efi/efi.c | 51 +-- drivers/firmware/efi/efi.c |3 + drivers/firmware/efi/libstub/efi-stub-helper.c | 12 + include/linux/efi.h |1 include/linux/ioport.h |1 12 files changed, 139 insertions(+), 11 deletions(-) create mode 100644 arch/x86/include/asm/efi-stub.h diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index 1c67acd1df65..dd28f0726309 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -1152,7 +1152,8 @@ Format: {"off" | "on" | "skip[mbr]"} efi=[EFI] - Format: { "old_map", "nochunk", "noruntime", "debug" } + Format: { "old_map", "nochunk", "noruntime", "debug", + "nosoftreserve" } old_map [X86-64]: switch to the old ioremap-based EFI runtime services mapping. 32-bit still uses this one by default. @@ -1161,6 +1162,12 @@ firmware implementations. noruntime : disable EFI runtime services support debug: enable misc debug output + nosoftreserve: The EFI_MEMORY_SP (Specific Purpose) + attribute may cause the kernel to reserve the + memory range for a memory mapping driver to + claim. Specify efi=nosoftreserve to disable this + reservation and treat the memory by its base type + (i.e. EFI_CONVENTIONAL_MEMORY / "System RAM"). efi_no_storage_paranoia [EFI; X86] Using this parameter you can use more than 50% of @@ -1173,15 +1180,21 @@ updating original EFI memory map. Region of memory which aa attribute is added to is from ss to ss+nn. + If efi_fake_mem=2G@4G:0x1,2G@0x10a000:0x1 is
[PATCH v5 07/10] dax: Fix alloc_dax_region() compile warning
PFN flags are (unsigned long long), fix the alloc_dax_region() calling convention to fix warnings of the form: >> include/linux/pfn_t.h:18:17: warning: large integer implicitly truncated to >> unsigned type [-Woverflow] #define PFN_DEV (1ULL << (BITS_PER_LONG_LONG - 3)) Reported-by: kbuild test robot Signed-off-by: Dan Williams --- drivers/dax/bus.c |2 +- drivers/dax/bus.h |2 +- drivers/dax/dax-private.h |2 +- 3 files changed, 3 insertions(+), 3 deletions(-) diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c index 8fafbeab510a..eccdda1f7b71 100644 --- a/drivers/dax/bus.c +++ b/drivers/dax/bus.c @@ -227,7 +227,7 @@ static void dax_region_unregister(void *region) struct dax_region *alloc_dax_region(struct device *parent, int region_id, struct resource *res, int target_node, unsigned int align, - unsigned long pfn_flags) + unsigned long long pfn_flags) { struct dax_region *dax_region; diff --git a/drivers/dax/bus.h b/drivers/dax/bus.h index 8619e3299943..9e4eba67e8b9 100644 --- a/drivers/dax/bus.h +++ b/drivers/dax/bus.h @@ -11,7 +11,7 @@ struct dax_region; void dax_region_put(struct dax_region *dax_region); struct dax_region *alloc_dax_region(struct device *parent, int region_id, struct resource *res, int target_node, unsigned int align, - unsigned long flags); + unsigned long long flags); enum dev_dax_subsys { DEV_DAX_BUS, diff --git a/drivers/dax/dax-private.h b/drivers/dax/dax-private.h index 6ccca3b890d6..3107ce80e809 100644 --- a/drivers/dax/dax-private.h +++ b/drivers/dax/dax-private.h @@ -32,7 +32,7 @@ struct dax_region { struct device *dev; unsigned int align; struct resource res; - unsigned long pfn_flags; + unsigned long long pfn_flags; }; /**
[PATCH v5 10/10] acpi/numa/hmat: Register "soft reserved" memory as an "hmem" device
Memory that has been tagged EFI_MEMORY_SP, and has performance properties described by the ACPI HMAT is expected to have an application specific consumer. Those consumers may want 100% of the memory capacity to be reserved from any usage by the kernel. By default, with this enabling, a platform device is created to represent this differentiated resource. The device-dax "hmem" driver claims these devices by default and provides an mmap interface for the target application. If the administrator prefers, the hmem resource range can be made available to the core-mm via the device-dax hotplug facility, kmem, to online the memory with its own numa node. This was tested with an emulated HMAT produced by qemu (with the pending HMAT enabling patches), and "efi_fake_mem=8G@9G:0x4" on the kernel command line to mark the memory ranges associated with node2 and node3 as EFI_MEMORY_SP. qemu numa configuration options: -numa node,mem=4G,cpus=0-19,nodeid=0 -numa node,mem=4G,cpus=20-39,nodeid=1 -numa node,mem=4G,nodeid=2 -numa node,mem=4G,nodeid=3 -numa dist,src=0,dst=0,val=10 -numa dist,src=0,dst=1,val=21 -numa dist,src=0,dst=2,val=21 -numa dist,src=0,dst=3,val=21 -numa dist,src=1,dst=0,val=21 -numa dist,src=1,dst=1,val=10 -numa dist,src=1,dst=2,val=21 -numa dist,src=1,dst=3,val=21 -numa dist,src=2,dst=0,val=21 -numa dist,src=2,dst=1,val=21 -numa dist,src=2,dst=2,val=10 -numa dist,src=2,dst=3,val=21 -numa dist,src=3,dst=0,val=21 -numa dist,src=3,dst=1,val=21 -numa dist,src=3,dst=2,val=21 -numa dist,src=3,dst=3,val=10 -numa hmat-lb,initiator=0,target=0,hierarchy=memory,data-type=access-latency,base-lat=10,latency=5 -numa hmat-lb,initiator=0,target=0,hierarchy=memory,data-type=access-bandwidth,base-bw=20,bandwidth=5 -numa hmat-lb,initiator=0,target=1,hierarchy=memory,data-type=access-latency,base-lat=10,latency=10 -numa hmat-lb,initiator=0,target=1,hierarchy=memory,data-type=access-bandwidth,base-bw=20,bandwidth=10 -numa hmat-lb,initiator=0,target=2,hierarchy=memory,data-type=access-latency,base-lat=10,latency=15 -numa hmat-lb,initiator=0,target=2,hierarchy=memory,data-type=access-bandwidth,base-bw=20,bandwidth=15 -numa hmat-lb,initiator=0,target=3,hierarchy=memory,data-type=access-latency,base-lat=10,latency=20 -numa hmat-lb,initiator=0,target=3,hierarchy=memory,data-type=access-bandwidth,base-bw=20,bandwidth=20 -numa hmat-lb,initiator=1,target=0,hierarchy=memory,data-type=access-latency,base-lat=10,latency=10 -numa hmat-lb,initiator=1,target=0,hierarchy=memory,data-type=access-bandwidth,base-bw=20,bandwidth=10 -numa hmat-lb,initiator=1,target=1,hierarchy=memory,data-type=access-latency,base-lat=10,latency=5 -numa hmat-lb,initiator=1,target=1,hierarchy=memory,data-type=access-bandwidth,base-bw=20,bandwidth=5 -numa hmat-lb,initiator=1,target=2,hierarchy=memory,data-type=access-latency,base-lat=10,latency=15 -numa hmat-lb,initiator=1,target=2,hierarchy=memory,data-type=access-bandwidth,base-bw=20,bandwidth=15 -numa hmat-lb,initiator=1,target=3,hierarchy=memory,data-type=access-latency,base-lat=10,latency=20 -numa hmat-lb,initiator=1,target=3,hierarchy=memory,data-type=access-bandwidth,base-bw=20,bandwidth=20 Result: # daxctl list -RDu [ { "path":"\/platform\/hmem.1", "id":1, "size":"4.00 GiB (4.29 GB)", "align":2097152, "devices":[ { "chardev":"dax1.0", "size":"4.00 GiB (4.29 GB)" } ] }, { "path":"\/platform\/hmem.0", "id":0, "size":"4.00 GiB (4.29 GB)", "align":2097152, "devices":[ { "chardev":"dax0.0", "size":"4.00 GiB (4.29 GB)" } ] } ] # cat /proc/iomem [..] 24000-43fff : Soft Reserved 24000-33fff : hmem.0 24000-33fff : dax0.0 34000-43fff : hmem.1 34000-43fff : dax1.0 Cc: Len Brown Cc: Keith Busch Cc: "Rafael J. Wysocki" Cc: Vishal Verma Cc: Jonathan Cameron Reviewed-by: Dave Hansen Signed-off-by: Dan Williams --- drivers/acpi/numa/Kconfig |1 drivers/acpi/numa/hmat.c | 136 + 2 files changed, 125 insertions(+), 12 deletions(-) diff --git a/drivers/acpi/numa/Kconfig b/drivers/acpi/numa/Kconfig index d14582387ed0..c1be746e111a 100644 --- a/drivers/acpi/numa/Kconfig +++ b/drivers/acpi/numa/Kconfig @@ -8,6 +8,7 @@ config ACPI_HMAT bool "ACPI Heterogeneous Memory Attribute Table Support" depends on ACPI_NUMA select HMEM_REPORTING + select MEMREGION help If set, this option has the kernel parse and report the platform's ACPI HMAT (Heterogeneous Memory Attributes Table), diff --git a/drivers/acpi/numa/hmat.c b/drivers/acpi/numa/hmat.c index 4707eb9dd07b..eaa5a0f93dec 100644 --- a/drivers/acpi/numa/hmat.c +++ b/drivers/acpi/numa/hmat.c @@ -8,12 +8,18 @@ * the applicable attributes with the node's interfaces. */ +#define pr_fmt(fmt) "acpi/hmat: " fmt +#define dev_fmt(fmt) "acpi/hmat: " fmt + #include #include
[PATCH v5 00/10] EFI Specific Purpose Memory Support
Changes since v4 [1]: - Rename the facility from "Application Reserved" to "Soft Reserved" to better reflect how the memory is treated. While the spec talks about "specific / application purpose" memory the expected kernel behavior is to make a best effort at reserving the memory from general purpose allocations. - Add a new efi=nosoftreserve option to disable consideration of the EFI_MEMORY_SP attribute at boot time. This is also motivated by Christoph's initial feedback of allowing the kernel to opt-out of the policy whims of the platform BIOS implementation. - Update the KASLR implementation to exclude soft-reserved memory including the case where soft-reserved memory is specified via the efi_fake_mem= attribute-override command-line option. - Move the memregion allocator to its own object file. v4 had it in kernel/resource.c which caused compile errors on Sparc. I otherwise could not find an appropriate place to stash it. - Rebase on a merge of tip/master and rafael/linux-next since the series collides with changes in both those trees. [1]: https://lore.kernel.org/r/156140036490.2951909.1837804994781523185.st...@dwillia2-desk3.amr.corp.intel.com/ --- Thomas, Rafael, This happens to collide with both your trees. I think the content warrants going through the x86 tree, but would need to publish commit: 5c7ed4385424 HMAT: Skip publishing target info for nodes with no online memory ...in Rafael's tree as a stable id for -tip to pull in, but I'm also open to other options. I've retained Dave's reviewed-by from v4. --- The EFI 2.8 Specification [2] introduces the EFI_MEMORY_SP ("specific purpose") memory attribute. This attribute bit replaces the deprecated ACPI HMAT "reservation hint" that was introduced in ACPI 6.2 and removed in ACPI 6.3. Given the increasing diversity of memory types that might be advertised to the operating system, there is a need for platform firmware to hint which memory ranges are free for the OS to use as general purpose memory and which ranges are intended for application specific usage. For example, an application with prior knowledge of the platform may expect to be able to exclusively allocate a precious / limited pool of high bandwidth memory. Alternatively, for the general purpose case, the operating system may want to make the memory available on a best effort basis as a unique numa-node with performance properties by the new CONFIG_HMEM_REPORTING [3] facility. In support of optionally allowing either application-exclusive and core-kernel-mm managed access to differentiated memory, claim EFI_MEMORY_SP ranges for exposure as "soft reserved" and assigned to a device-dax instance by default. Such instances can be directly owned / mapped by a platform-topology-aware application. Alternatively, with the new kmem facility [4], the administrator has the option to instead designate that those memory ranges be hot-added to the core-kernel-mm as a unique memory numa-node. In short, allow for the decision about what software agent manages soft-reserved memory to be made at runtime. The patches build on the new HMAT+HMEM_REPORTING facilities merged for v5.2-rc1. The implementation is tested with qemu emulation of HMAT [5] plus the efi_fake_mem facility for applying the EFI_MEMORY_SP attribute. Specific details on reproducing the test configuration are in patch 10. [2]: https://uefi.org/sites/default/files/resources/UEFI_Spec_2_8_final.pdf [3]: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=e1cf33aafb84 [4]: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c221c0b0308f [5]: http://patchwork.ozlabs.org/cover/1096737/ --- Dan Williams (10): acpi/numa: Establish a new drivers/acpi/numa/ directory efi: Enumerate EFI_MEMORY_SP x86, efi: Push EFI_MEMMAP check into leaf routines x86, efi: Reserve UEFI 2.8 Specific Purpose Memory for dax x86, efi: Add efi_fake_mem support for EFI_MEMORY_SP lib: Uplevel the pmem "region" ida to a global allocator dax: Fix alloc_dax_region() compile warning device-dax: Add a driver for "hmem" devices acpi/numa/hmat: Register HMAT at device_initcall level acpi/numa/hmat: Register "soft reserved" memory as an "hmem" device Documentation/admin-guide/kernel-parameters.txt | 19 +++ arch/x86/Kconfig| 21 arch/x86/boot/compressed/eboot.c|7 + arch/x86/boot/compressed/kaslr.c| 50 +++- arch/x86/include/asm/e820/types.h |8 + arch/x86/include/asm/efi-stub.h | 11 ++ arch/x86/include/asm/efi.h | 17 +++ arch/x86/kernel/e820.c | 12 ++ arch/x86/kernel/setup.c | 19 ++- arch/x86/platform/efi/efi.c | 56 + arch/x86/platform/efi/quirks.c |3 + drivers/acpi/Kconfig
[PATCH v5 06/10] lib: Uplevel the pmem "region" ida to a global allocator
In preparation for handling platform differentiated memory types beyond persistent memory, uplevel the "region" identifier to a global number space. This enables a device-dax instance to be registered to any memory type with guaranteed unique names. Cc: Keith Busch Signed-off-by: Dan Williams --- drivers/nvdimm/Kconfig |1 + drivers/nvdimm/core.c|1 - drivers/nvdimm/nd-core.h |1 - drivers/nvdimm/region_devs.c | 13 - include/linux/memregion.h| 19 +++ lib/Kconfig |3 +++ lib/Makefile |1 + lib/memregion.c | 18 ++ 8 files changed, 46 insertions(+), 11 deletions(-) create mode 100644 include/linux/memregion.h create mode 100644 lib/memregion.c diff --git a/drivers/nvdimm/Kconfig b/drivers/nvdimm/Kconfig index a5fde15e91d3..f6e7bcb3f9a5 100644 --- a/drivers/nvdimm/Kconfig +++ b/drivers/nvdimm/Kconfig @@ -4,6 +4,7 @@ menuconfig LIBNVDIMM depends on PHYS_ADDR_T_64BIT depends on HAS_IOMEM depends on BLK_DEV + select MEMREGION help Generic support for non-volatile memory devices including ACPI-6-NFIT defined resources. On platforms that define an diff --git a/drivers/nvdimm/core.c b/drivers/nvdimm/core.c index 9204f1e9fd14..e592c4964674 100644 --- a/drivers/nvdimm/core.c +++ b/drivers/nvdimm/core.c @@ -455,7 +455,6 @@ static __exit void libnvdimm_exit(void) nd_region_exit(); nvdimm_exit(); nvdimm_bus_exit(); - nd_region_devs_exit(); nvdimm_devs_exit(); } diff --git a/drivers/nvdimm/nd-core.h b/drivers/nvdimm/nd-core.h index 0ac52b6eb00e..8408412fba1b 100644 --- a/drivers/nvdimm/nd-core.h +++ b/drivers/nvdimm/nd-core.h @@ -127,7 +127,6 @@ struct nvdimm_bus *walk_to_nvdimm_bus(struct device *nd_dev); int __init nvdimm_bus_init(void); void nvdimm_bus_exit(void); void nvdimm_devs_exit(void); -void nd_region_devs_exit(void); void nd_region_probe_success(struct nvdimm_bus *nvdimm_bus, struct device *dev); struct nd_region; void nd_region_create_ns_seed(struct nd_region *nd_region); diff --git a/drivers/nvdimm/region_devs.c b/drivers/nvdimm/region_devs.c index af30cbe7a8ea..5a152a897c94 100644 --- a/drivers/nvdimm/region_devs.c +++ b/drivers/nvdimm/region_devs.c @@ -3,6 +3,7 @@ * Copyright(c) 2013-2015 Intel Corporation. All rights reserved. */ #include +#include #include #include #include @@ -19,7 +20,6 @@ */ #include -static DEFINE_IDA(region_ida); static DEFINE_PER_CPU(int, flush_idx); static int nvdimm_map_flush(struct device *dev, struct nvdimm *nvdimm, int dimm, @@ -133,7 +133,7 @@ static void nd_region_release(struct device *dev) put_device(>dev); } free_percpu(nd_region->lane); - ida_simple_remove(_ida, nd_region->id); + memregion_free(nd_region->id); if (is_nd_blk(dev)) kfree(to_nd_blk_region(dev)); else @@ -1034,7 +1034,7 @@ static struct nd_region *nd_region_create(struct nvdimm_bus *nvdimm_bus, if (!region_buf) return NULL; - nd_region->id = ida_simple_get(_ida, 0, 0, GFP_KERNEL); + nd_region->id = memregion_alloc(GFP_KERNEL); if (nd_region->id < 0) goto err_id; @@ -1093,7 +1093,7 @@ static struct nd_region *nd_region_create(struct nvdimm_bus *nvdimm_bus, return nd_region; err_percpu: - ida_simple_remove(_ida, nd_region->id); + memregion_free(nd_region->id); err_id: kfree(region_buf); return NULL; @@ -1262,8 +1262,3 @@ int nd_region_conflict(struct nd_region *nd_region, resource_size_t start, return device_for_each_child(_bus->dev, , region_conflict); } - -void __exit nd_region_devs_exit(void) -{ - ida_destroy(_ida); -} diff --git a/include/linux/memregion.h b/include/linux/memregion.h new file mode 100644 index ..07ecde0bd136 --- /dev/null +++ b/include/linux/memregion.h @@ -0,0 +1,19 @@ +// SPDX-License-Identifier: GPL-2.0 +#ifndef _MEMREGION_H_ +#define _MEMREGION_H_ +#include +#include + +#ifdef CONFIG_MEMREGION +int memregion_alloc(gfp_t gfp); +void memregion_free(int id); +#else +static inline int memregion_alloc(gfp_t gfp) +{ + return -ENOMEM; +} +void memregion_free(int id) +{ +} +#endif +#endif /* _MEMREGION_H_ */ diff --git a/lib/Kconfig b/lib/Kconfig index f33d66fc0e86..9fce7e15716a 100644 --- a/lib/Kconfig +++ b/lib/Kconfig @@ -607,6 +607,9 @@ config ARCH_NO_SG_CHAIN config ARCH_HAS_PMEM_API bool +config MEMREGION + bool + # use memcpy to implement user copies for nommu architectures config UACCESS_MEMCPY bool diff --git a/lib/Makefile b/lib/Makefile index c5892807e06f..2fb7b47018f1 100644 --- a/lib/Makefile +++ b/lib/Makefile @@ -212,6 +212,7 @@ obj-$(CONFIG_GENERIC_NET_UTILS) += net_utils.o obj-$(CONFIG_SG_SPLIT) += sg_split.o obj-$(CONFIG_SG_POOL) += sg_pool.o
[PATCH v5 05/10] x86, efi: Add efi_fake_mem support for EFI_MEMORY_SP
Given that EFI_MEMORY_SP is platform BIOS policy descision for marking memory ranges as "reserved for a specific purpose" there will inevitably be scenarios where the BIOS omits the attribute in situations where it is desired. Unlike other attributes if the OS wants to reserve this memory from the kernel the reservation needs to happen early in init. So early, in fact, that it needs to happen before e820__memblock_setup() which is a pre-requisite for efi_fake_memmap() that wants to allocate memory for the updated table. Introduce an x86 specific efi_fake_memmap_early() that can search for attempts to set EFI_MEMORY_SP via efi_fake_mem and update the e820 table accordingly. The KASLR code that scans the command line looking for user-directed memory reservations also needs to be updated to consider "efi_fake_mem=nn@ss:0x4" requests. Cc: Cc: Borislav Petkov Cc: Ingo Molnar Cc: "H. Peter Anvin" Cc: Thomas Gleixner Cc: Ard Biesheuvel Reviewed-by: Dave Hansen Signed-off-by: Dan Williams --- arch/x86/boot/compressed/kaslr.c| 46 --- arch/x86/include/asm/efi.h |8 arch/x86/platform/efi/efi.c |2 + drivers/firmware/efi/Makefile |5 ++- drivers/firmware/efi/fake_mem.c | 24 ++-- drivers/firmware/efi/fake_mem.h | 10 + drivers/firmware/efi/x86-fake_mem.c | 69 +++ 7 files changed, 143 insertions(+), 21 deletions(-) create mode 100644 drivers/firmware/efi/fake_mem.h create mode 100644 drivers/firmware/efi/x86-fake_mem.c diff --git a/arch/x86/boot/compressed/kaslr.c b/arch/x86/boot/compressed/kaslr.c index 093e84e28b7a..53ed3991f9a8 100644 --- a/arch/x86/boot/compressed/kaslr.c +++ b/arch/x86/boot/compressed/kaslr.c @@ -133,8 +133,14 @@ char *skip_spaces(const char *str) #include "../../../../lib/ctype.c" #include "../../../../lib/cmdline.c" +enum parse_mode { + PARSE_MEMMAP, + PARSE_EFI, +}; + static int -parse_memmap(char *p, unsigned long long *start, unsigned long long *size) +parse_memmap(char *p, unsigned long long *start, unsigned long long *size, + enum parse_mode mode) { char *oldp; @@ -157,8 +163,33 @@ parse_memmap(char *p, unsigned long long *start, unsigned long long *size) *start = memparse(p + 1, ); return 0; case '@': - /* memmap=nn@ss specifies usable region, should be skipped */ - *size = 0; + if (mode == PARSE_MEMMAP) { + /* +* memmap=nn@ss specifies usable region, should +* be skipped +*/ + *size = 0; + } else { + unsigned long long flags; + + /* +* efi_fake_mem=nn@ss:attr the attr specifies +* flags that might imply a soft-reservation. +*/ + *start = memparse(p + 1, ); + if (p && *p == ':') { + p++; + oldp = p; + flags = simple_strtoull(p, , 0); + if (p == oldp) + *size = 0; + else if (flags & EFI_MEMORY_SP) + return 0; + else + *size = 0; + } else + *size = 0; + } /* Fall through */ default: /* @@ -173,7 +204,7 @@ parse_memmap(char *p, unsigned long long *start, unsigned long long *size) return -EINVAL; } -static void mem_avoid_memmap(char *str) +static void mem_avoid_memmap(enum parse_mode mode, char *str) { static int i; @@ -188,7 +219,7 @@ static void mem_avoid_memmap(char *str) if (k) *k++ = 0; - rc = parse_memmap(str, , ); + rc = parse_memmap(str, , , mode); if (rc < 0) break; str = k; @@ -239,7 +270,6 @@ static void parse_gb_huge_pages(char *param, char *val) } } - static void handle_mem_options(void) { char *args = (char *)get_cmd_line_ptr(); @@ -272,7 +302,7 @@ static void handle_mem_options(void) } if (!strcmp(param, "memmap")) { - mem_avoid_memmap(val); + mem_avoid_memmap(PARSE_MEMMAP, val); } else if (strstr(param, "hugepages")) { parse_gb_huge_pages(param, val); } else if (!strcmp(param, "mem")) { @@ -285,6 +315,8 @@ static void handle_mem_options(void) goto out; mem_limit = mem_size; +
[PATCH v5 03/10] x86, efi: Push EFI_MEMMAP check into leaf routines
In preparation for adding another EFI_MEMMAP dependent call that needs to occur before e820__memblock_setup() fixup the existing efi calls to check for EFI_MEMMAP internally. This ends up being cleaner than the alternative of checking EFI_MEMMAP multiple times in setup_arch(). Cc: Cc: Ingo Molnar Cc: "H. Peter Anvin" Cc: Andy Lutomirski Cc: Thomas Gleixner Cc: Peter Zijlstra Reviewed-by: Dave Hansen Signed-off-by: Dan Williams --- arch/x86/include/asm/efi.h |9 - arch/x86/kernel/setup.c | 19 +-- arch/x86/platform/efi/efi.c |3 +++ arch/x86/platform/efi/quirks.c |3 +++ drivers/firmware/efi/esrt.c |3 +++ drivers/firmware/efi/fake_mem.c |2 +- include/linux/efi.h |2 -- 7 files changed, 27 insertions(+), 14 deletions(-) diff --git a/arch/x86/include/asm/efi.h b/arch/x86/include/asm/efi.h index 43a82e59c59d..45f853bce869 100644 --- a/arch/x86/include/asm/efi.h +++ b/arch/x86/include/asm/efi.h @@ -140,7 +140,6 @@ extern void efi_delete_dummy_variable(void); extern void efi_switch_mm(struct mm_struct *mm); extern void efi_recover_from_page_fault(unsigned long phys_addr); extern void efi_free_boot_services(void); -extern void efi_reserve_boot_services(void); struct efi_setup_data { u64 fw_vendor; @@ -244,6 +243,8 @@ static inline bool efi_is_64bit(void) extern bool efi_reboot_required(void); extern bool efi_is_table_address(unsigned long phys_addr); +extern void efi_find_mirror(void); +extern void efi_reserve_boot_services(void); #else static inline void parse_efi_setup(u64 phys_addr, u32 data_len) {} static inline bool efi_reboot_required(void) @@ -254,6 +255,12 @@ static inline bool efi_is_table_address(unsigned long phys_addr) { return false; } +static inline void efi_find_mirror(void) +{ +} +static inline void efi_reserve_boot_services(void) +{ +} #endif /* CONFIG_EFI */ #endif /* _ASM_X86_EFI_H */ diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c index bbe35bf879f5..9bfecb542440 100644 --- a/arch/x86/kernel/setup.c +++ b/arch/x86/kernel/setup.c @@ -1118,21 +1118,20 @@ void __init setup_arch(char **cmdline_p) cleanup_highmap(); memblock_set_current_limit(ISA_END_ADDRESS); + e820__memblock_setup(); reserve_bios_regions(); - if (efi_enabled(EFI_MEMMAP)) { - efi_fake_memmap(); - efi_find_mirror(); - efi_esrt_init(); + efi_fake_memmap(); + efi_find_mirror(); + efi_esrt_init(); - /* -* The EFI specification says that boot service code won't be -* called after ExitBootServices(). This is, in fact, a lie. -*/ - efi_reserve_boot_services(); - } + /* +* The EFI specification says that boot service code won't be +* called after ExitBootServices(). This is, in fact, a lie. +*/ + efi_reserve_boot_services(); /* preallocate 4k for mptable mpc */ e820__memblock_alloc_reserved_mpc_new(); diff --git a/arch/x86/platform/efi/efi.c b/arch/x86/platform/efi/efi.c index c202e1b07e29..0bb58eb33ca0 100644 --- a/arch/x86/platform/efi/efi.c +++ b/arch/x86/platform/efi/efi.c @@ -128,6 +128,9 @@ void __init efi_find_mirror(void) efi_memory_desc_t *md; u64 mirror_size = 0, total_size = 0; + if (!efi_enabled(EFI_MEMMAP)) + return; + for_each_efi_memory_desc(md) { unsigned long long start = md->phys_addr; unsigned long long size = md->num_pages << EFI_PAGE_SHIFT; diff --git a/arch/x86/platform/efi/quirks.c b/arch/x86/platform/efi/quirks.c index 3b9fd679cea9..7675cf754d90 100644 --- a/arch/x86/platform/efi/quirks.c +++ b/arch/x86/platform/efi/quirks.c @@ -320,6 +320,9 @@ void __init efi_reserve_boot_services(void) { efi_memory_desc_t *md; + if (!efi_enabled(EFI_MEMMAP)) + return; + for_each_efi_memory_desc(md) { u64 start = md->phys_addr; u64 size = md->num_pages << EFI_PAGE_SHIFT; diff --git a/drivers/firmware/efi/esrt.c b/drivers/firmware/efi/esrt.c index d6dd5f503fa2..2762e0662bf4 100644 --- a/drivers/firmware/efi/esrt.c +++ b/drivers/firmware/efi/esrt.c @@ -246,6 +246,9 @@ void __init efi_esrt_init(void) int rc; phys_addr_t end; + if (!efi_enabled(EFI_MEMMAP)) + return; + pr_debug("esrt-init: loading.\n"); if (!esrt_table_exists()) return; diff --git a/drivers/firmware/efi/fake_mem.c b/drivers/firmware/efi/fake_mem.c index 9501edc0fcfb..526b45331d96 100644 --- a/drivers/firmware/efi/fake_mem.c +++ b/drivers/firmware/efi/fake_mem.c @@ -44,7 +44,7 @@ void __init efi_fake_memmap(void) void *new_memmap; int i; - if (!nr_fake_mem) + if (!efi_enabled(EFI_MEMMAP) || !nr_fake_mem) return; /*
[PATCH v5 02/10] efi: Enumerate EFI_MEMORY_SP
UEFI 2.8 defines an EFI_MEMORY_SP attribute bit to augment the interpretation of the EFI Memory Types as "reserved for a specific purpose". The intent of this bit is to allow the OS to identify precious or scarce memory resources and optionally manage it separately from EfiConventionalMemory. As defined older OSes that do not know about this attribute are permitted to ignore it and the memory will be handled according to the OS default policy for the given memory type. In other words, this "specific purpose" hint is deliberately weaker than EfiReservedMemoryType in that the system continues to operate if the OS takes no action on the attribute. The risk of taking no action is potentially unwanted / unmovable kernel allocations from the designated resource that prevent the full realization of the "specific purpose". For example, consider a system with a high-bandwidth memory pool. Older kernels are permitted to boot and consume that memory as conventional "System-RAM" newer kernels may arrange for that memory to be set aside (soft reserved) by the system administrator for a dedicated high-bandwidth memory aware application to consume. Specifically, this mechanism allows for the elimination of scenarios where platform firmware tries to game OS policy by lying about ACPI SLIT values, i.e. claiming that a precious memory resource has a high distance to trigger the OS to avoid it by default. This reservation hint allows platform-firmware to instead tell the truth about performance characteristics by indicate to OS memory management to put immovable allocations elsewhere. Implement simple detection of the bit for EFI memory table dumps and save the kernel policy for a follow-on change. Reviewed-by: Ard Biesheuvel Reviewed-by: Dave Hansen Signed-off-by: Dan Williams --- drivers/firmware/efi/efi.c |5 +++-- include/linux/efi.h|1 + 2 files changed, 4 insertions(+), 2 deletions(-) diff --git a/drivers/firmware/efi/efi.c b/drivers/firmware/efi/efi.c index 8f1ab04f6743..363bb9d00fa5 100644 --- a/drivers/firmware/efi/efi.c +++ b/drivers/firmware/efi/efi.c @@ -833,15 +833,16 @@ char * __init efi_md_typeattr_format(char *buf, size_t size, if (attr & ~(EFI_MEMORY_UC | EFI_MEMORY_WC | EFI_MEMORY_WT | EFI_MEMORY_WB | EFI_MEMORY_UCE | EFI_MEMORY_RO | EFI_MEMORY_WP | EFI_MEMORY_RP | EFI_MEMORY_XP | -EFI_MEMORY_NV | +EFI_MEMORY_NV | EFI_MEMORY_SP | EFI_MEMORY_RUNTIME | EFI_MEMORY_MORE_RELIABLE)) snprintf(pos, size, "|attr=0x%016llx]", (unsigned long long)attr); else snprintf(pos, size, -"|%3s|%2s|%2s|%2s|%2s|%2s|%2s|%3s|%2s|%2s|%2s|%2s]", + "|%3s|%2s|%2s|%2s|%2s|%2s|%2s|%2s|%3s|%2s|%2s|%2s|%2s]", attr & EFI_MEMORY_RUNTIME ? "RUN" : "", attr & EFI_MEMORY_MORE_RELIABLE ? "MR" : "", +attr & EFI_MEMORY_SP ? "SP" : "", attr & EFI_MEMORY_NV ? "NV" : "", attr & EFI_MEMORY_XP ? "XP" : "", attr & EFI_MEMORY_RP ? "RP" : "", diff --git a/include/linux/efi.h b/include/linux/efi.h index bd3837022307..5c1dd0221384 100644 --- a/include/linux/efi.h +++ b/include/linux/efi.h @@ -112,6 +112,7 @@ typedef struct { #define EFI_MEMORY_MORE_RELIABLE \ ((u64)0x0001ULL)/* higher reliability */ #define EFI_MEMORY_RO ((u64)0x0002ULL)/* read-only */ +#define EFI_MEMORY_SP ((u64)0x0004ULL)/* soft reserved */ #define EFI_MEMORY_RUNTIME ((u64)0x8000ULL)/* range requires runtime mapping */ #define EFI_MEMORY_DESCRIPTOR_VERSION 1
[PATCH v5 01/10] acpi/numa: Establish a new drivers/acpi/numa/ directory
Currently hmat.c lives under an "hmat" directory which does not enhance the description of the file. The initial motivation for giving hmat.c its own directory was to delineate it as mm functionality in contrast to ACPI device driver functionality. As ACPI continues to play an increasing role in conveying memory location and performance topology information to the OS take the opportunity to co-locate these NUMA relevant tables in a combined directory. numa.c is renamed to srat.c and moved to drivers/acpi/numa/ along with hmat.c. Cc: Len Brown Cc: Keith Busch Cc: "Rafael J. Wysocki" Reviewed-by: Dave Hansen Signed-off-by: Dan Williams --- drivers/acpi/Kconfig |9 + drivers/acpi/Makefile |3 +-- drivers/acpi/hmat/Makefile |2 -- drivers/acpi/numa/Kconfig |7 ++- drivers/acpi/numa/Makefile |3 +++ drivers/acpi/numa/hmat.c |0 drivers/acpi/numa/srat.c |0 7 files changed, 11 insertions(+), 13 deletions(-) delete mode 100644 drivers/acpi/hmat/Makefile rename drivers/acpi/{hmat/Kconfig => numa/Kconfig} (72%) create mode 100644 drivers/acpi/numa/Makefile rename drivers/acpi/{hmat/hmat.c => numa/hmat.c} (100%) rename drivers/acpi/{numa.c => numa/srat.c} (100%) diff --git a/drivers/acpi/Kconfig b/drivers/acpi/Kconfig index 5f6158973289..8c7c46065e9d 100644 --- a/drivers/acpi/Kconfig +++ b/drivers/acpi/Kconfig @@ -319,12 +319,6 @@ config ACPI_THERMAL To compile this driver as a module, choose M here: the module will be called thermal. -config ACPI_NUMA - bool "NUMA support" - depends on NUMA - depends on (X86 || IA64 || ARM64) - default y if IA64_GENERIC || IA64_SGI_SN2 || ARM64 - config ACPI_CUSTOM_DSDT_FILE string "Custom DSDT Table file to include" default "" @@ -473,8 +467,7 @@ config ACPI_REDUCED_HARDWARE_ONLY If you are unsure what to do, do not enable this option. source "drivers/acpi/nfit/Kconfig" -source "drivers/acpi/hmat/Kconfig" - +source "drivers/acpi/numa/Kconfig" source "drivers/acpi/apei/Kconfig" source "drivers/acpi/dptf/Kconfig" diff --git a/drivers/acpi/Makefile b/drivers/acpi/Makefile index 5d361e4e3405..f08a661274e8 100644 --- a/drivers/acpi/Makefile +++ b/drivers/acpi/Makefile @@ -55,7 +55,6 @@ acpi-$(CONFIG_X86)+= acpi_cmos_rtc.o acpi-$(CONFIG_X86) += x86/apple.o acpi-$(CONFIG_X86) += x86/utils.o acpi-$(CONFIG_DEBUG_FS)+= debugfs.o -acpi-$(CONFIG_ACPI_NUMA) += numa.o acpi-$(CONFIG_ACPI_PROCFS_POWER) += cm_sbs.o acpi-y += acpi_lpat.o acpi-$(CONFIG_ACPI_LPIT) += acpi_lpit.o @@ -80,7 +79,7 @@ obj-$(CONFIG_ACPI_PROCESSOR) += processor.o obj-$(CONFIG_ACPI) += container.o obj-$(CONFIG_ACPI_THERMAL) += thermal.o obj-$(CONFIG_ACPI_NFIT)+= nfit/ -obj-$(CONFIG_ACPI_HMAT)+= hmat/ +obj-$(CONFIG_ACPI_NUMA)+= numa/ obj-$(CONFIG_ACPI) += acpi_memhotplug.o obj-$(CONFIG_ACPI_HOTPLUG_IOAPIC) += ioapic.o obj-$(CONFIG_ACPI_BATTERY) += battery.o diff --git a/drivers/acpi/hmat/Makefile b/drivers/acpi/hmat/Makefile deleted file mode 100644 index 1c20ef36a385.. --- a/drivers/acpi/hmat/Makefile +++ /dev/null @@ -1,2 +0,0 @@ -# SPDX-License-Identifier: GPL-2.0-only -obj-$(CONFIG_ACPI_HMAT) := hmat.o diff --git a/drivers/acpi/hmat/Kconfig b/drivers/acpi/numa/Kconfig similarity index 72% rename from drivers/acpi/hmat/Kconfig rename to drivers/acpi/numa/Kconfig index 95a29964dbea..d14582387ed0 100644 --- a/drivers/acpi/hmat/Kconfig +++ b/drivers/acpi/numa/Kconfig @@ -1,4 +1,9 @@ -# SPDX-License-Identifier: GPL-2.0 +config ACPI_NUMA + bool "NUMA support" + depends on NUMA + depends on (X86 || IA64 || ARM64) + default y if IA64_GENERIC || IA64_SGI_SN2 || ARM64 + config ACPI_HMAT bool "ACPI Heterogeneous Memory Attribute Table Support" depends on ACPI_NUMA diff --git a/drivers/acpi/numa/Makefile b/drivers/acpi/numa/Makefile new file mode 100644 index ..517a6c689a94 --- /dev/null +++ b/drivers/acpi/numa/Makefile @@ -0,0 +1,3 @@ +# SPDX-License-Identifier: GPL-2.0-only +obj-$(CONFIG_ACPI_NUMA) += srat.o +obj-$(CONFIG_ACPI_HMAT) += hmat.o diff --git a/drivers/acpi/hmat/hmat.c b/drivers/acpi/numa/hmat.c similarity index 100% rename from drivers/acpi/hmat/hmat.c rename to drivers/acpi/numa/hmat.c diff --git a/drivers/acpi/numa.c b/drivers/acpi/numa/srat.c similarity index 100% rename from drivers/acpi/numa.c rename to drivers/acpi/numa/srat.c
Re: [PATCH 1/2] efi+tpm: Don't access event->count when it isn't mapped.
On Tue, Aug 27, 2019 at 06:11:58PM -0400, Peter Jones wrote: > On Tue, Aug 27, 2019 at 04:41:55PM +0300, Jarkko Sakkinen wrote: > > On Tue, Aug 27, 2019 at 02:03:44PM +0300, Jarkko Sakkinen wrote: > > > > Jarkko, these two should probably go to 5.3 if possible - I > > > > independently had a report of a system hitting this issue last week > > > > (Intel apparently put a surprising amount of data in the event logs on > > > > the NUCs). > > > > > > OK, I can try to push them. I'll do PR today. > > > > Ard, how do you wish these to be managed? > > > > I'm asking this because: > > > > 1. Neither patch was CC'd to linux-integrity. > > 2. Neither patch has your tags ATM. > > I think Ard's not back until September. I can just to re-send them with > the accumulated tags and Cc linux-integrity, if you think that would > help? I take the risk. If possible, add all the cumulated tags to those patches... /Jarkko