Re: [PATCH] drivers: nvdimm: fix memleak
Konstantin Meskhidze writes: > Memory pointed by 'nd_pmu->pmu.attr_groups' is allocated in function > 'register_nvdimm_pmu' and is lost after 'kfree(nd_pmu)' call in function > 'unregister_nvdimm_pmu'. > > Co-developed-by: Ivanov Mikhail > Signed-off-by: Konstantin Meskhidze > --- > drivers/nvdimm/nd_perf.c | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/drivers/nvdimm/nd_perf.c b/drivers/nvdimm/nd_perf.c > index 433bbb68a..14881c4e0 100644 > --- a/drivers/nvdimm/nd_perf.c > +++ b/drivers/nvdimm/nd_perf.c > @@ -323,7 +323,8 @@ EXPORT_SYMBOL_GPL(register_nvdimm_pmu); > void unregister_nvdimm_pmu(struct nvdimm_pmu *nd_pmu) > { > perf_pmu_unregister(&nd_pmu->pmu); > nvdimm_pmu_free_hotplug_memory(nd_pmu); > + kfree(nd_pmu->pmu.attr_groups); > kfree(nd_pmu); > } > EXPORT_SYMBOL_GPL(unregister_nvdimm_pmu); Reviewed-by: Jeff Moyer
Re: [PATCH] drivers: nvdimm: fix dereference after free
Konstantin Meskhidze writes: > 'nd_pmu->pmu.attr_groups' is dereferenced in function > 'nvdimm_pmu_free_hotplug_memory' call after it has been freed. Because in > function 'nvdimm_pmu_free_hotplug_memory' memory pointed by the fields of > 'nd_pmu->pmu.attr_groups' is deallocated it is necessary to call 'kfree' > after 'nvdimm_pmu_free_hotplug_memory'. > > Co-developed-by: Ivanov Mikhail > Signed-off-by: Konstantin Meskhidze > --- > drivers/nvdimm/nd_perf.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/drivers/nvdimm/nd_perf.c b/drivers/nvdimm/nd_perf.c > index 14881c4e0..2b6dc80d8 100644 > --- a/drivers/nvdimm/nd_perf.c > +++ b/drivers/nvdimm/nd_perf.c > @@ -307,10 +307,10 @@ int register_nvdimm_pmu(struct nvdimm_pmu *nd_pmu, > struct platform_device *pdev) > } > > rc = perf_pmu_register(&nd_pmu->pmu, nd_pmu->pmu.name, -1); > if (rc) { > - kfree(nd_pmu->pmu.attr_groups); > nvdimm_pmu_free_hotplug_memory(nd_pmu); > + kfree(nd_pmu->pmu.attr_groups); > return rc; > } > > pr_info("%s NVDIMM performance monitor support registered\n", Reviewed-by: Jeff Moyer
Re: [PATCH 0/3] mm: use memmap_on_memory semantics for dax/kmem
David Hildenbrand writes: > On 13.07.23 21:12, Jeff Moyer wrote: >> David Hildenbrand writes: >> >>> On 16.06.23 00:00, Vishal Verma wrote: >>>> The dax/kmem driver can potentially hot-add large amounts of memory >>>> originating from CXL memory expanders, or NVDIMMs, or other 'device >>>> memories'. There is a chance there isn't enough regular system memory >>>> available to fit ythe memmap for this new memory. It's therefore >>>> desirable, if all other conditions are met, for the kmem managed memory >>>> to place its memmap on the newly added memory itself. >>>> >>>> Arrange for this by first allowing for a module parameter override for >>>> the mhp_supports_memmap_on_memory() test using a flag, adjusting the >>>> only other caller of this interface in dirvers/acpi/acpi_memoryhotplug.c, >>>> exporting the symbol so it can be called by kmem.c, and finally changing >>>> the kmem driver to add_memory() in chunks of memory_block_size_bytes(). >>> >>> 1) Why is the override a requirement here? Just let the admin >>> configure it then then add conditional support for kmem. >>> >>> 2) I recall that there are cases where we don't want the memmap to >>> land on slow memory (which online_movable would achieve). Just imagine >>> the slow PMEM case. So this might need another configuration knob on >>> the kmem side. >> >> From my memory, the case where you don't want the memmap to land on >> *persistent memory* is when the device is small (such as NVDIMM-N), and >> you want to reserve as much space as possible for the application data. >> This has nothing to do with the speed of access. > > Now that you mention it, I also do remember the origin of the altmap -- > to achieve exactly that: place the memmap on the device. > > commit 4b94ffdc4163bae1ec73b6e977ffb7a7da3d06d3 > Author: Dan Williams > Date: Fri Jan 15 16:56:22 2016 -0800 > > x86, mm: introduce vmem_altmap to augment vmemmap_populate() > In support of providing struct page for large persistent memory > capacities, use struct vmem_altmap to change the default policy for > allocating memory for the memmap array. The default vmemmap_populate() > allocates page table storage area from the page allocator. Given > persistent memory capacities relative to DRAM it may not be feasible to > store the memmap in 'System Memory'. Instead vmem_altmap represents > pre-allocated "device pages" to satisfy vmemmap_alloc_block_buf() > requests. > > In PFN_MODE_PMEM (and only then), we use the altmap (don't see a way to > configure it). Configuration is done at pmem namespace creation time. The metadata for the namespace indicates where the memmap resides. See the ndctl-create-namespace man page: -M, --map= A pmem namespace in "fsdax" or "devdax" mode requires allocation of per-page metadata. The allocation can be drawn from either: · "mem": typical system memory · "dev": persistent memory reserved from the namespace Given relative capacities of "Persistent Memory" to "System RAM" the allocation defaults to reserving space out of the namespace directly ("--map=dev"). The overhead is 64-bytes per 4K (16GB per 1TB) on x86. > BUT that case is completely different from the "System RAM" mode. The memmap > of an NVDIMM in pmem mode is barely used by core-mm (i.e., not the buddy). Right. (btw, I don't think system ram mode existed back then.) > In comparison, if the buddy and everybody else works on the memmap in > "System RAM", it's much more significant if that resides on slow memory. Agreed. > Looking at > > commit 9b6e63cbf85b89b2dbffa4955dbf2df8250e5375 > Author: Michal Hocko > Date: Tue Oct 3 16:16:19 2017 -0700 > > mm, page_alloc: add scheduling point to memmap_init_zone > memmap_init_zone gets a pfn range to initialize and it can be > really > large resulting in a soft lockup on non-preemptible kernels > NMI watchdog: BUG: soft lockup - CPU#31 stuck for 23s! > [kworker/u642:5:1720] > [...] > task: 88ecd7e902c0 ti: 88eca4e5 task.ti: 88eca4e5 > RIP: move_pfn_range_to_zone+0x185/0x1d0 > [...] > Call Trace: > devm_memremap_pages+0x2c7/0x430 > pmem_attach_disk+0x2fd/0x3f0 [nd_pmem] > nvdimm_bus_probe+0x64/0x110 [libnvdimm] > > > It
Re: [PATCH 0/3] mm: use memmap_on_memory semantics for dax/kmem
David Hildenbrand writes: > On 16.06.23 00:00, Vishal Verma wrote: >> The dax/kmem driver can potentially hot-add large amounts of memory >> originating from CXL memory expanders, or NVDIMMs, or other 'device >> memories'. There is a chance there isn't enough regular system memory >> available to fit ythe memmap for this new memory. It's therefore >> desirable, if all other conditions are met, for the kmem managed memory >> to place its memmap on the newly added memory itself. >> >> Arrange for this by first allowing for a module parameter override for >> the mhp_supports_memmap_on_memory() test using a flag, adjusting the >> only other caller of this interface in dirvers/acpi/acpi_memoryhotplug.c, >> exporting the symbol so it can be called by kmem.c, and finally changing >> the kmem driver to add_memory() in chunks of memory_block_size_bytes(). > > 1) Why is the override a requirement here? Just let the admin > configure it then then add conditional support for kmem. > > 2) I recall that there are cases where we don't want the memmap to > land on slow memory (which online_movable would achieve). Just imagine > the slow PMEM case. So this might need another configuration knob on > the kmem side. >From my memory, the case where you don't want the memmap to land on *persistent memory* is when the device is small (such as NVDIMM-N), and you want to reserve as much space as possible for the application data. This has nothing to do with the speed of access. -Jeff
Re: [PATCH v2] libnvdimm/region: Allow setting align attribute on regions without mappings
Tyler Hicks writes: > The alignment constraint for namespace creation in a region was > increased, from 2M to 16M, for non-PowerPC architectures in v5.7 with > commit 2522afb86a8c ("libnvdimm/region: Introduce an 'align' > attribute"). The thought behind the change was that region alignment > should be uniform across all architectures and, since PowerPC had the > largest alignment constraint of 16M, all architectures should conform to > that alignment. > > The change regressed namespace creation in pre-defined regions that > relied on 2M alignment but a workaround was provided in the form of a > sysfs attribute, named 'align', that could be adjusted to a non-default > alignment value. > > However, the sysfs attribute's store function returned an error (-ENXIO) > when userspace attempted to change the alignment of a region that had no > mappings. This affected 2M aligned regions of volatile memory that were > defined in a device tree using "pmem-region" and created by the > of_pmem_region_driver, since those regions do not contain mappings > (ndr_mappings is 0). > > Allow userspace to set the align attribute on pre-existing regions that > do not have mappings so that namespaces can still be within those > regions, despite not being aligned to 16M. > > Link: > https://lore.kernel.org/lkml/CA+CK2bDJ3hrWoE91L2wpAk+Yu0_=GtYw=4glddd7mxs321b...@mail.gmail.com > Fixes: 2522afb86a8c ("libnvdimm/region: Introduce an 'align' attribute") > Signed-off-by: Tyler Hicks > --- > > While testing with a recent kernel release (6.0-rc3), I rediscovered > this bug and eventually realized that I never followed through with > fixing it upstream. After a year later, here's the v2 that Aneesh > requested. Sorry about that! > > v2: > - Included Aneesh's feedback to ensure the val is a power of 2 and > greater than PAGE_SIZE even for regions without mappings > - Reused the max_t() trick from default_align() to avoid special > casing, with an if-else, when regions have mappings and when they > don't > + Didn't include Pavel's Reviewed-by since this is a slightly > different approach than what he reviewed in v1 > - Added a Link commit tag to Pavel's initial problem description > v1: > https://lore.kernel.org/lkml/20210326152645.85225-1-tyhi...@linux.microsoft.com/ > > drivers/nvdimm/region_devs.c | 8 +++- > 1 file changed, 3 insertions(+), 5 deletions(-) > > diff --git a/drivers/nvdimm/region_devs.c b/drivers/nvdimm/region_devs.c > index 473a71bbd9c9..550ea0bd6c53 100644 > --- a/drivers/nvdimm/region_devs.c > +++ b/drivers/nvdimm/region_devs.c > @@ -509,16 +509,13 @@ static ssize_t align_store(struct device *dev, > { > struct nd_region *nd_region = to_nd_region(dev); > unsigned long val, dpa; > - u32 remainder; > + u32 mappings, remainder; > int rc; > > rc = kstrtoul(buf, 0, &val); > if (rc) > return rc; > > - if (!nd_region->ndr_mappings) > - return -ENXIO; > - > /* >* Ensure space-align is evenly divisible by the region >* interleave-width because the kernel typically has no facility > @@ -526,7 +523,8 @@ static ssize_t align_store(struct device *dev, >* contribute to the tail capacity in system-physical-address >* space for the namespace. >*/ > - dpa = div_u64_rem(val, nd_region->ndr_mappings, &remainder); > + mappings = max_t(u32, 1, nd_region->ndr_mappings); > + dpa = div_u64_rem(val, mappings, &remainder); > if (!is_power_of_2(dpa) || dpa < PAGE_SIZE > || val > region_size(nd_region) || remainder) > return -EINVAL; The math all looks okay, and this matches what's done in default_align. Unfortunately, I don't know enough about the power architecture to understand how you can have a region with no dimms (ndr_mappings == 0). -Jeff
Re: [RFC PATCH rdma-next 01/10] RDMA: mr: Introduce is_pmem
"Li, Zhijian" writes: > Copied to nvdimm list > > Thanks > > Zhijian > > > on 2022/1/6 14:12, Li Zhijian wrote: >> >> Add Dan to the party :) >> >> May i know whether there is any existing APIs to check whether >> a va/page backs to a nvdimm/pmem ? I don't know of one. You could try walk_system_ram_range looking for IORES_DESC_PERSISTENT_MEMORY, but that's not very efficient. >>> You need to get Dan to check this out, but I'm pretty sure this should >>> be more like this: >>> >>> if (is_zone_device_page(page) && page->pgmap->type == >>> MEMORY_DEVICE_FS_DAX) You forgot MEMORY_DEVICE_GENERIC. However, this doesn't guarantee the memory belongs to persistent memory, only that it is direct access capable. Dan, any ideas? -Jeff
Re: [block] 52f019d43c: ndctl.test-libndctl.fail
Christoph Hellwig writes: > Dan, > > can you make any sense of thos report? > > name='nfit_test' > path='/lib/modules/5.11.0-rc5-3-g52f019d43c22/extra/test/nfit_test.ko' >> check_set_config_data: dimm: 0 read2 data miscompare: 0 >> check_set_config_data: dimm: 0x1 read2 data miscompare: 0 >> check_set_config_data: dimm: 0x100 read2 data miscompare: 0 >> check_set_config_data: dimm: 0x101 read2 data miscompare: 0 >> check_dax_autodetect: dax_ndns: 0x558a74d92f00 ndns: 0x558a74d92f00 >> check_dax_autodetect: dax_ndns: 0x558a74d91f40 ndns: 0x558a74d91f40 >> check_pfn_autodetect: pfn_ndns: 0x558a74d91f40 ndns: 0x558a74d91f40 >> check_pfn_autodetect: pfn_ndns: 0x558a74d8c5e0 ndns: 0x558a74d8c5e0 >> check_btt_autodetect: btt_ndns: 0x558a74d8c5e0 ndns: 0x558a74d8c5e0 >> check_btt_autodetect: btt_ndns: 0x558a74da1390 ndns: 0x558a74da1390 >> check_btt_autodetect: btt_ndns: 0x558a74d8c5e0 ndns: 0x558a74d8c5e0 >> check_btt_autodetect: btt_ndns: 0x558a74d91f40 ndns: 0x558a74d91f40 >> namespace7.0: failed to write /dev/pmem7 >> check_namespaces: namespace7.0 validate_bdev failed >> ndctl-test1 failed: -6 This is from test/libndctl.c in the ndctl repo: fd = open(bdevpath, O_RDONLY); if (fd < 0) { fprintf(stderr, "%s: failed to open(%s, O_RDONLY)\n", devname, bdevpath); return -ENXIO; } ... ro = 0; rc = ioctl(fd, BLKROSET, &ro); if (rc < 0) { fprintf(stderr, "%s: BLKROSET failed\n", devname); rc = -errno; goto out; } close(fd); fd = open(bdevpath, O_RDWR|O_DIRECT); ... if (write(fd, buf, 4096) < 4096) { fprintf(stderr, "%s: failed to write %s\n", devname, bdevpath); rc = -ENXIO; goto out; } HTH, Jeff
Re: [PATCH] direct-io: Using kmem_cache_zalloc() instead of kmem_cache_alloc() and memset()
Y, Yang, Yang Li writes: > Fix the following coccicheck warning: > ./fs/direct-io.c:1155:7-23: WARNING: kmem_cache_zalloc should be used > for dio, instead of kmem_cache_alloc/memset > > Reported-by: Abaci Robot > Signed-off-by: Yang Li > --- > fs/direct-io.c | 4 +--- > 1 file changed, 1 insertion(+), 3 deletions(-) > > diff --git a/fs/direct-io.c b/fs/direct-io.c > index 0957e1b..6ec2935 100644 > --- a/fs/direct-io.c > +++ b/fs/direct-io.c > @@ -1152,7 +1152,7 @@ static inline int drop_refcount(struct dio *dio) > if (iov_iter_rw(iter) == READ && !count) > return 0; > > - dio = kmem_cache_alloc(dio_cache, GFP_KERNEL); > + dio = kmem_cache_zalloc(dio_cache, GFP_KERNEL); > if (!dio) > return -ENOMEM; > /* > @@ -1160,8 +1160,6 @@ static inline int drop_refcount(struct dio *dio) >* performance regression in a database benchmark. So, we take >* care to only zero out what's needed. >*/ > - memset(dio, 0, offsetof(struct dio, pages)); > - You must have missed the comment just above this memset: /* * Believe it or not, zeroing out the page array caused a .5% * performance regression in a database benchmark. So, we take * care to only zero out what's needed. */ That's referring to this part of the dio struct: /* * pages[] (and any fields placed after it) are not zeroed out at * allocation time. Don't add new fields after pages[] unless you * wish that they not be zeroed. */ union { struct page *pages[DIO_PAGES]; /* page buffer */ struct work_struct complete_work;/* deferred AIO completion */ }; } cacheline_aligned_in_smp; Nacked-by: Jeff Moyer
Re: [PATCH] ACPI: NFIT: Fix input validation of bus-family
Dan Williams writes: > Dan reports that smatch thinks userspace can craft an out-of-bound bus > family number. However, nd_cmd_clear_to_send() blocks all non-zero > values of bus-family since only the kernel can initiate these commands. > However, in the speculation path, family is a user controlled array > index value so mask it for speculation safety. Also, since the > nd_cmd_clear_to_send() safety is non-obvious and possibly may change in > the future include input validation is if userspace could get past the > nd_cmd_clear_to_send() gatekeeper. > > Link: http://lore.kernel.org/r/2020113000.GA1237157@mwanda > Reported-by: Dan Carpenter > Fixes: 6450ddbd5d8e ("ACPI: NFIT: Define runtime firmware activation > commands") > Cc: > Signed-off-by: Dan Williams > --- > drivers/acpi/nfit/core.c |6 +- > 1 file changed, 5 insertions(+), 1 deletion(-) > > diff --git a/drivers/acpi/nfit/core.c b/drivers/acpi/nfit/core.c > index cda7b6c52504..b11b08a60684 100644 > --- a/drivers/acpi/nfit/core.c > +++ b/drivers/acpi/nfit/core.c > @@ -5,6 +5,7 @@ > #include > #include > #include > +#include > #include > #include > #include > @@ -479,8 +480,11 @@ int acpi_nfit_ctl(struct nvdimm_bus_descriptor *nd_desc, > struct nvdimm *nvdimm, > cmd_mask = nd_desc->cmd_mask; > if (cmd == ND_CMD_CALL && call_pkg->nd_family) { > family = call_pkg->nd_family; > - if (!test_bit(family, &nd_desc->bus_family_mask)) > + if (family > NVDIMM_BUS_FAMILY_MAX || > + !test_bit(family, &nd_desc->bus_family_mask)) > return -EINVAL; > + family = array_index_nospec(family, > + NVDIMM_BUS_FAMILY_MAX + 1); > dsm_mask = acpi_desc->family_dsm_mask[family]; > guid = to_nfit_bus_uuid(family); > } else { Reviewed-by: Jeff Moyer
Re: [PATCH] fs/dax: Fix pmd vs pte conflict detection
Dan Williams writes: > Check for NULL entries before checking the entry order, otherwise NULL > is misinterpreted as a present pte conflict. The 'order' check needs to > happen before the locked check as an unlocked entry at the wrong order > must fallback to lookup the correct order. Please include the user-visible effects of the problem in the changelog. Thanks, Jeff > > Reported-by: Jeff Smits > Reported-by: Doug Nelson > Cc: > Fixes: 23c84eb78375 ("dax: Fix missed wakeup with PMD faults") > Cc: Jan Kara > Cc: Matthew Wilcox (Oracle) > Signed-off-by: Dan Williams > --- > fs/dax.c |5 +++-- > 1 file changed, 3 insertions(+), 2 deletions(-) > > diff --git a/fs/dax.c b/fs/dax.c > index a71881e77204..08160011d94c 100644 > --- a/fs/dax.c > +++ b/fs/dax.c > @@ -221,10 +221,11 @@ static void *get_unlocked_entry(struct xa_state *xas, > unsigned int order) > > for (;;) { > entry = xas_find_conflict(xas); > + if (!entry || WARN_ON_ONCE(!xa_is_value(entry))) > + return entry; > if (dax_entry_order(entry) < order) > return XA_RETRY_ENTRY; > - if (!entry || WARN_ON_ONCE(!xa_is_value(entry)) || > - !dax_is_locked(entry)) > + if (!dax_is_locked(entry)) > return entry; > > wq = dax_entry_waitqueue(xas, entry, &ewait.key);
Re: [PATCH 00/13] nvdimm: Use more common kernel coding style
Joe Perches writes: > Rather than have a local coding style, use the typical kernel style. The coding style isn't that different from the core kernel, and it's still quite readable. I'd rather avoid the churn and the risk of introducing regressions. This will also make backports to stable more of a pain, so it isn't without cost. Dan, is this really something you want to do? -Jeff > > Joe Perches (13): > nvdimm: Use more typical whitespace > nvdimm: Move logical continuations to previous line > nvdimm: Use octal permissions > nvdimm: Use a more common kernel spacing style > nvdimm: Use "unsigned int" in preference to "unsigned" > nvdimm: Add and remove blank lines > nvdimm: Use typical kernel brace styles > nvdimm: Use typical kernel style indentation > nvdimm: btt.h: Neaten #defines to improve readability > nvdimm: namespace_devs: Move assignment operators > nvdimm: Use more common logic testing styles and bare ; positions > nvdimm: namespace_devs: Change progess typo to progress > nvdimm: Miscellaneous neatening > > drivers/nvdimm/badrange.c | 22 +- > drivers/nvdimm/blk.c| 39 ++-- > drivers/nvdimm/btt.c| 249 +++-- > drivers/nvdimm/btt.h| 56 ++--- > drivers/nvdimm/btt_devs.c | 68 +++--- > drivers/nvdimm/bus.c| 138 ++-- > drivers/nvdimm/claim.c | 50 ++--- > drivers/nvdimm/core.c | 42 ++-- > drivers/nvdimm/dax_devs.c | 3 +- > drivers/nvdimm/dimm.c | 3 +- > drivers/nvdimm/dimm_devs.c | 107 - > drivers/nvdimm/e820.c | 2 +- > drivers/nvdimm/label.c | 213 +- > drivers/nvdimm/label.h | 6 +- > drivers/nvdimm/namespace_devs.c | 472 > +--- > drivers/nvdimm/nd-core.h| 31 +-- > drivers/nvdimm/nd.h | 94 > drivers/nvdimm/nd_virtio.c | 20 +- > drivers/nvdimm/of_pmem.c| 6 +- > drivers/nvdimm/pfn_devs.c | 136 ++-- > drivers/nvdimm/pmem.c | 57 ++--- > drivers/nvdimm/pmem.h | 2 +- > drivers/nvdimm/region.c | 20 +- > drivers/nvdimm/region_devs.c| 160 +++--- > drivers/nvdimm/security.c | 138 ++-- > drivers/nvdimm/virtio_pmem.c| 10 +- > 26 files changed, 1115 insertions(+), 1029 deletions(-)
Re: [PATCH v2 0/3] libnvdimm/security: Enumerate the frozen state and other cleanups
Dan Williams writes: > Changes since v1 [1]: > - Cleanup patch1, simplify flags return in the overwrite case and > consolidate frozen-state cases (Jeff) > - Clarify the motivation for patch2 (Jeff) > - Collect Dave's Reviewed-by The series tests out fine for me. Thanks, Dan! -Jeff > > [1]: https://lists.01.org/pipermail/linux-nvdimm/2019-August/023133.html > > --- > > Jeff reported a scenario where ndctl was failing to unlock DIMMs [2]. > Through the course of debug it was discovered that the security > interface on the DIMMs was in the 'frozen' state disallowing unlock, or > any security operation. Unfortunately the kernel only showed that the > DIMMs were 'locked', not 'locked' and 'frozen'. > > Introduce a new sysfs 'frozen' attribute so that ndctl can reflect the > "security-operations-allowed" state independently of the lock status. > Then, followup with cleanups related to replacing a security-state-enum > with a set of flags. > > [2]: https://lists.01.org/pipermail/linux-nvdimm/2019-August/022856.html > > --- > > Dan Williams (3): > libnvdimm/security: Introduce a 'frozen' attribute > libnvdimm/security: Tighten scope of nvdimm->busy vs security operations > libnvdimm/security: Consolidate 'security' operations > > > drivers/acpi/nfit/intel.c| 59 ++- > drivers/nvdimm/bus.c |2 > drivers/nvdimm/dimm_devs.c | 134 ++ > drivers/nvdimm/nd-core.h | 51 -- > drivers/nvdimm/security.c| 199 > +- > include/linux/libnvdimm.h|9 +- > tools/testing/nvdimm/dimm_devs.c | 19 +--- > 7 files changed, 226 insertions(+), 247 deletions(-)
Re: [PATCH v2 2/3] libnvdimm/security: Tighten scope of nvdimm->busy vs security operations
Dan Williams writes: > An attempt to freeze DIMMs currently runs afoul of default blocking of > all security operations in the entry to the 'store' routine for the > 'security' sysfs attribute. > > The blanket blocking of all security operations while the DIMM is in > active use in a region is too restrictive. The only security operations > that need to be aware of the ->busy state are those that mutate the > state of data, i.e. erase and overwrite. > > Refactor the ->busy checks to be applied at the entry common entry point > in __security_store() rather than each of the helper routines to enable > freeze to be run regardless of busy state. > > Reviewed-by: Dave Jiang > Signed-off-by: Dan Williams Reviewed-by: Jeff Moyer > --- > drivers/nvdimm/dimm_devs.c | 33 - > drivers/nvdimm/security.c | 10 -- > 2 files changed, 16 insertions(+), 27 deletions(-) > > diff --git a/drivers/nvdimm/dimm_devs.c b/drivers/nvdimm/dimm_devs.c > index 53330625fe07..d837cb9be83d 100644 > --- a/drivers/nvdimm/dimm_devs.c > +++ b/drivers/nvdimm/dimm_devs.c > @@ -424,9 +424,6 @@ static ssize_t __security_store(struct device *dev, const > char *buf, size_t len) > unsigned int key, newkey; > int i; > > - if (atomic_read(&nvdimm->busy)) > - return -EBUSY; > - > rc = sscanf(buf, "%"__stringify(SEC_CMD_SIZE)"s" > " %"__stringify(KEY_ID_SIZE)"s" > " %"__stringify(KEY_ID_SIZE)"s", > @@ -451,23 +448,25 @@ static ssize_t __security_store(struct device *dev, > const char *buf, size_t len) > } else if (i == OP_DISABLE) { > dev_dbg(dev, "disable %u\n", key); > rc = nvdimm_security_disable(nvdimm, key); > - } else if (i == OP_UPDATE) { > - dev_dbg(dev, "update %u %u\n", key, newkey); > - rc = nvdimm_security_update(nvdimm, key, newkey, NVDIMM_USER); > - } else if (i == OP_ERASE) { > - dev_dbg(dev, "erase %u\n", key); > - rc = nvdimm_security_erase(nvdimm, key, NVDIMM_USER); > + } else if (i == OP_UPDATE || i == OP_MASTER_UPDATE) { > + dev_dbg(dev, "%s %u %u\n", ops[i].name, key, newkey); > + rc = nvdimm_security_update(nvdimm, key, newkey, i == OP_UPDATE > + ? NVDIMM_USER : NVDIMM_MASTER); > + } else if (i == OP_ERASE || i == OP_MASTER_ERASE) { > + dev_dbg(dev, "%s %u\n", ops[i].name, key); > + if (atomic_read(&nvdimm->busy)) { > + dev_dbg(dev, "Unable to secure erase while DIMM > active.\n"); > + return -EBUSY; > + } > + rc = nvdimm_security_erase(nvdimm, key, i == OP_ERASE > + ? NVDIMM_USER : NVDIMM_MASTER); > } else if (i == OP_OVERWRITE) { > dev_dbg(dev, "overwrite %u\n", key); > + if (atomic_read(&nvdimm->busy)) { > + dev_dbg(dev, "Unable to overwrite while DIMM > active.\n"); > + return -EBUSY; > + } > rc = nvdimm_security_overwrite(nvdimm, key); > - } else if (i == OP_MASTER_UPDATE) { > - dev_dbg(dev, "master_update %u %u\n", key, newkey); > - rc = nvdimm_security_update(nvdimm, key, newkey, > - NVDIMM_MASTER); > - } else if (i == OP_MASTER_ERASE) { > - dev_dbg(dev, "master_erase %u\n", key); > - rc = nvdimm_security_erase(nvdimm, key, > - NVDIMM_MASTER); > } else > return -EINVAL; > > diff --git a/drivers/nvdimm/security.c b/drivers/nvdimm/security.c > index 5862d0eee9db..2166e627383a 100644 > --- a/drivers/nvdimm/security.c > +++ b/drivers/nvdimm/security.c > @@ -334,11 +334,6 @@ int nvdimm_security_erase(struct nvdimm *nvdimm, > unsigned int keyid, > || !nvdimm->sec.flags) > return -EOPNOTSUPP; > > - if (atomic_read(&nvdimm->busy)) { > - dev_dbg(dev, "Unable to secure erase while DIMM active.\n"); > - return -EBUSY; > - } > - > rc = check_security_state(nvdimm); > if (rc) > return rc; > @@ -380,11 +375,6 @@ int nvdimm_security_overwrite(struct nvdimm *nvdimm, > unsigned int keyid) > || !nvdimm->sec.flags) > return -EOPNOTSUPP; > > - if (atomic_read(&nvdimm->busy)) { > - dev_dbg(dev, "Unable to overwrite while DIMM active.\n"); > - return -EBUSY; > - } > - > if (dev->driver == NULL) { > dev_dbg(dev, "Unable to overwrite while DIMM active.\n"); > return -EINVAL; > > ___ > Linux-nvdimm mailing list > linux-nvd...@lists.01.org > https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH v2 1/3] libnvdimm/security: Introduce a 'frozen' attribute
Dan Williams writes: > In the process of debugging a system with an NVDIMM that was failing to > unlock it was found that the kernel is reporting 'locked' while the DIMM > security interface is 'frozen'. Unfortunately the security state is > tracked internally as an enum which prevents it from communicating the > difference between 'locked' and 'locked + frozen'. It follows that the > enum also prevents the kernel from communicating 'unlocked + frozen' > which would be useful for debugging why security operations like 'change > passphrase' are disabled. > > Ditch the security state enum for a set of flags and introduce a new > sysfs attribute explicitly for the 'frozen' state. The regression risk > is low because the 'frozen' state was already blocked behind the > 'locked' state, but will need to revisit if there were cases where > applications need 'frozen' to show up in the primary 'security' > attribute. The expectation is that communicating 'frozen' is mostly a > helper for debug and status monitoring. > > Reviewed-by: Dave Jiang > Reported-by: Jeff Moyer > Signed-off-by: Dan Williams Reviewed-by: Jeff Moyer > --- > drivers/acpi/nfit/intel.c| 59 --- > drivers/nvdimm/bus.c |2 - > drivers/nvdimm/dimm_devs.c | 59 +-- > drivers/nvdimm/nd-core.h | 21 ++-- > drivers/nvdimm/security.c| 99 > ++ > include/linux/libnvdimm.h|9 ++- > tools/testing/nvdimm/dimm_devs.c | 19 ++- > 7 files changed, 141 insertions(+), 127 deletions(-) > > diff --git a/drivers/acpi/nfit/intel.c b/drivers/acpi/nfit/intel.c > index cddd0fcf622c..1113b679cd7b 100644 > --- a/drivers/acpi/nfit/intel.c > +++ b/drivers/acpi/nfit/intel.c > @@ -7,10 +7,11 @@ > #include "intel.h" > #include "nfit.h" > > -static enum nvdimm_security_state intel_security_state(struct nvdimm *nvdimm, > +static unsigned long intel_security_flags(struct nvdimm *nvdimm, > enum nvdimm_passphrase_type ptype) > { > struct nfit_mem *nfit_mem = nvdimm_provider_data(nvdimm); > + unsigned long security_flags = 0; > struct { > struct nd_cmd_pkg pkg; > struct nd_intel_get_security_state cmd; > @@ -27,7 +28,7 @@ static enum nvdimm_security_state > intel_security_state(struct nvdimm *nvdimm, > int rc; > > if (!test_bit(NVDIMM_INTEL_GET_SECURITY_STATE, &nfit_mem->dsm_mask)) > - return -ENXIO; > + return 0; > > /* >* Short circuit the state retrieval while we are doing overwrite. > @@ -35,38 +36,42 @@ static enum nvdimm_security_state > intel_security_state(struct nvdimm *nvdimm, >* until the overwrite DSM completes. >*/ > if (nvdimm_in_overwrite(nvdimm) && ptype == NVDIMM_USER) > - return NVDIMM_SECURITY_OVERWRITE; > + return BIT(NVDIMM_SECURITY_OVERWRITE); > > rc = nvdimm_ctl(nvdimm, ND_CMD_CALL, &nd_cmd, sizeof(nd_cmd), NULL); > - if (rc < 0) > - return rc; > - if (nd_cmd.cmd.status) > - return -EIO; > + if (rc < 0 || nd_cmd.cmd.status) { > + pr_err("%s: security state retrieval failed (%d:%#x)\n", > + nvdimm_name(nvdimm), rc, nd_cmd.cmd.status); > + return 0; > + } > > /* check and see if security is enabled and locked */ > if (ptype == NVDIMM_MASTER) { > if (nd_cmd.cmd.extended_state & ND_INTEL_SEC_ESTATE_ENABLED) > - return NVDIMM_SECURITY_UNLOCKED; > - else if (nd_cmd.cmd.extended_state & > - ND_INTEL_SEC_ESTATE_PLIMIT) > - return NVDIMM_SECURITY_FROZEN; > - } else { > - if (nd_cmd.cmd.state & ND_INTEL_SEC_STATE_UNSUPPORTED) > - return -ENXIO; > - else if (nd_cmd.cmd.state & ND_INTEL_SEC_STATE_ENABLED) { > - if (nd_cmd.cmd.state & ND_INTEL_SEC_STATE_LOCKED) > - return NVDIMM_SECURITY_LOCKED; > - else if (nd_cmd.cmd.state & ND_INTEL_SEC_STATE_FROZEN > - || nd_cmd.cmd.state & > - ND_INTEL_SEC_STATE_PLIMIT) > - return NVDIMM_SECURITY_FROZEN; > - else > - return NVDIMM_SECURITY_UNLOCKED; > -
Re: [PATCH 2/2] drivers/dax/kmem: give a warning if CONFIG_DEV_DAX_PMEM_COMPAT is enabled
Jia He writes: > commit c221c0b0308f ("device-dax: "Hotplug" persistent memory for use > like normal RAM") helps to add persistent memory as normal RAM blocks. > But this driver doesn't work if CONFIG_DEV_DAX_PMEM_COMPAT is enabled. > > Here is the debugging call trace when CONFIG_DEV_DAX_PMEM_COMPAT is > enabled. > [4.443730] devm_memremap_pages+0x4b9/0x540 > [4.443733] dev_dax_probe+0x112/0x220 [device_dax] > [4.443735] dax_pmem_compat_probe+0x58/0x92 [dax_pmem_compat] > [4.443737] nvdimm_bus_probe+0x6b/0x150 > [4.443739] really_probe+0xf5/0x3d0 > [4.443740] driver_probe_device+0x11b/0x130 > [4.443741] device_driver_attach+0x58/0x60 > [4.443742] __driver_attach+0xa3/0x140 > > Then the dax0.0 device will be registered as "nd" bus instead of > "dax" bus. This causes the error as follows: > root@ubuntu:~# echo dax0.0 > /sys/bus/dax/drivers/device_dax/unbind > -bash: echo: write error: No such device > > This gives a warning to notify the user. > > Signed-off-by: Jia He > --- > drivers/dax/kmem.c | 5 + > 1 file changed, 5 insertions(+) > > diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c > index ad62d551d94e..b77f0e880598 100644 > --- a/drivers/dax/kmem.c > +++ b/drivers/dax/kmem.c > @@ -93,6 +93,11 @@ static struct dax_device_driver device_dax_kmem_driver = { > > static int __init dax_kmem_init(void) > { > + if (IS_ENABLED(CONFIG_DEV_DAX_PMEM_COMPAT)) { > + pr_warn("CONFIG_DEV_DAX_PMEM_COMPAT is not compatible\n"); > + pr_warn("kmem dax driver might not be workable\n"); > + } > + > return dax_driver_register(&device_dax_kmem_driver); > } This logic is wrong (and the error message is *very* confusing). You can have the driver configured, but not loaded. In that case, the kmem driver will load and work properly. When using daxctl to online memory, you already get the following message: libdaxctl: daxctl_dev_disable: dax0.0: error: device model is dax-class That's still not very helpful. It would be better if the message suggested a fix (like using migrate-device-model). Vishal? Cheers, Jeff
Re: [PATCH] aio: Fix io_pgetevents() struct __compat_aio_sigset layout
Guillem Jover writes: > This type is used to pass the sigset_t from userland to the kernel, > but it was using the kernel native pointer type for the member > representing the compat userland pointer to the userland sigset_t. > > This messes up the layout, and makes the kernel eat up both the > userland pointer and the size members into the kernel pointer, and > then reads garbage into the kernel sigsetsize. Which makes the sigset_t > size consistency check fail, and consequently the syscall always > returns -EINVAL. > > This breaks both libaio and strace on 32-bit userland running on 64-bit > kernels. And there are apparently no users in the wild of the current > broken layout (at least according to codesearch.debian.org and a brief > check over github.com search). So it looks safe to fix this directly > in the kernel, instead of either letting userland deal with this > permanently with the additional overhead or trying to make the syscall > infer what layout userland used, even though this is also being worked > around in libaio to temporarily cope with kernels that have not yet > been fixed. > > We use a proper compat_uptr_t instead of a compat_sigset_t pointer. > > Fixes: 7a074e96 ("aio: implement io_pgetevents") > Signed-off-by: Guillem Jover Looks good, thanks for finding and fixing this! Reviewed-by: Jeff Moyer > --- > fs/aio.c | 6 +++--- > 1 file changed, 3 insertions(+), 3 deletions(-) > > diff --git a/fs/aio.c b/fs/aio.c > index 01e0fb9ae45a..056f291bc66f 100644 > --- a/fs/aio.c > +++ b/fs/aio.c > @@ -2179,7 +2179,7 @@ SYSCALL_DEFINE5(io_getevents_time32, __u32, ctx_id, > #ifdef CONFIG_COMPAT > > struct __compat_aio_sigset { > - compat_sigset_t __user *sigmask; > + compat_uptr_t sigmask; > compat_size_t sigsetsize; > }; > > @@ -2204,7 +2204,7 @@ COMPAT_SYSCALL_DEFINE6(io_pgetevents, > if (usig && copy_from_user(&ksig, usig, sizeof(ksig))) > return -EFAULT; > > - ret = set_compat_user_sigmask(ksig.sigmask, ksig.sigsetsize); > + ret = set_compat_user_sigmask(compat_ptr(ksig.sigmask), > ksig.sigsetsize); > if (ret) > return ret; > > @@ -2239,7 +2239,7 @@ COMPAT_SYSCALL_DEFINE6(io_pgetevents_time64, > if (usig && copy_from_user(&ksig, usig, sizeof(ksig))) > return -EFAULT; > > - ret = set_compat_user_sigmask(ksig.sigmask, ksig.sigsetsize); > + ret = set_compat_user_sigmask(compat_ptr(ksig.sigmask), > ksig.sigsetsize); > if (ret) > return ret;
Re: [PATCH] aio: remove redundant assignment to variable ret
Colin King writes: > From: Colin Ian King > > The variable ret is being set to -EINVAL however this is never read > and later it is being reassigned to a new value. The assignment is > redundant and hence can be removed. > > Addresses-Coverity: ("Unused Value") > Signed-off-by: Colin Ian King > --- > fs/aio.c | 1 - > 1 file changed, 1 deletion(-) > > diff --git a/fs/aio.c b/fs/aio.c > index f9f441b59966..3e290dfac10a 100644 > --- a/fs/aio.c > +++ b/fs/aio.c > @@ -1528,7 +1528,6 @@ static int aio_read(struct kiocb *req, const struct > iocb *iocb, > file = req->ki_filp; > if (unlikely(!(file->f_mode & FMODE_READ))) > return -EBADF; > - ret = -EINVAL; > if (unlikely(!file->f_op->read_iter)) > return -EINVAL; Acked-by: Jeff Moyer
Re: [PATCH 2/3] libnvdimm/security: Tighten scope of nvdimm->busy vs security operations
Dan Williams writes: > On Fri, Aug 16, 2019 at 1:49 PM Jeff Moyer wrote: >> >> Dan Williams writes: >> >> > The blanket blocking of all security operations while the DIMM is in >> > active use in a region is too restrictive. The only security operations >> > that need to be aware of the ->busy state are those that mutate the >> > state of data, i.e. erase and overwrite. >> > >> > Refactor the ->busy checks to be applied at the entry common entry point >> > in __security_store() rather than each of the helper routines. >> >> I'm not sure this buys you much. Did you test this on actual hardware >> to make sure your assumptions are correct? I guess the worst case is we >> get an "invalid security state" error back from the firmware >> >> Still, what's the motivation for this? > > The motivation was when I went to test setting the frozen state and > found that it complained about the DIMM being active. There's nothing > wrong with freezing security while the DIMM is mapped. ...but then I > somehow managed to write this generalized commit message that left out > the explicit failure I was worried about. Yes, moved too fast, but the > motivation is "allow freeze while active" and centralize the ->busy > check so it's just one function to review that common constraint. OK, thanks for the info. Reviewed-by: Jeff Moyer
Re: [PATCH 3/3] libnvdimm/security: Consolidate 'security' operations
Dan Williams writes: > The security operations are exported from libnvdimm/security.c to > libnvdimm/dimm_devs.c, and libnvdimm/security.c is optionally compiled > based on the CONFIG_NVDIMM_KEYS config symbol. > > Rather than export the operations across compile objects, just move the > __security_store() entry point to live with the helpers. > > Cc: Dave Jiang > Signed-off-by: Dan Williams Fine by me. Acked-by: Jeff Moyer
Re: [PATCH 2/3] libnvdimm/security: Tighten scope of nvdimm->busy vs security operations
Dan Williams writes: > The blanket blocking of all security operations while the DIMM is in > active use in a region is too restrictive. The only security operations > that need to be aware of the ->busy state are those that mutate the > state of data, i.e. erase and overwrite. > > Refactor the ->busy checks to be applied at the entry common entry point > in __security_store() rather than each of the helper routines. I'm not sure this buys you much. Did you test this on actual hardware to make sure your assumptions are correct? I guess the worst case is we get an "invalid security state" error back from the firmware Still, what's the motivation for this? -Jeff > > Cc: Dave Jiang > Signed-off-by: Dan Williams > --- > drivers/nvdimm/dimm_devs.c | 33 - > drivers/nvdimm/security.c | 10 -- > 2 files changed, 16 insertions(+), 27 deletions(-) > > diff --git a/drivers/nvdimm/dimm_devs.c b/drivers/nvdimm/dimm_devs.c > index 53330625fe07..d837cb9be83d 100644 > --- a/drivers/nvdimm/dimm_devs.c > +++ b/drivers/nvdimm/dimm_devs.c > @@ -424,9 +424,6 @@ static ssize_t __security_store(struct device *dev, const > char *buf, size_t len) > unsigned int key, newkey; > int i; > > - if (atomic_read(&nvdimm->busy)) > - return -EBUSY; > - > rc = sscanf(buf, "%"__stringify(SEC_CMD_SIZE)"s" > " %"__stringify(KEY_ID_SIZE)"s" > " %"__stringify(KEY_ID_SIZE)"s", > @@ -451,23 +448,25 @@ static ssize_t __security_store(struct device *dev, > const char *buf, size_t len) > } else if (i == OP_DISABLE) { > dev_dbg(dev, "disable %u\n", key); > rc = nvdimm_security_disable(nvdimm, key); > - } else if (i == OP_UPDATE) { > - dev_dbg(dev, "update %u %u\n", key, newkey); > - rc = nvdimm_security_update(nvdimm, key, newkey, NVDIMM_USER); > - } else if (i == OP_ERASE) { > - dev_dbg(dev, "erase %u\n", key); > - rc = nvdimm_security_erase(nvdimm, key, NVDIMM_USER); > + } else if (i == OP_UPDATE || i == OP_MASTER_UPDATE) { > + dev_dbg(dev, "%s %u %u\n", ops[i].name, key, newkey); > + rc = nvdimm_security_update(nvdimm, key, newkey, i == OP_UPDATE > + ? NVDIMM_USER : NVDIMM_MASTER); > + } else if (i == OP_ERASE || i == OP_MASTER_ERASE) { > + dev_dbg(dev, "%s %u\n", ops[i].name, key); > + if (atomic_read(&nvdimm->busy)) { > + dev_dbg(dev, "Unable to secure erase while DIMM > active.\n"); > + return -EBUSY; > + } > + rc = nvdimm_security_erase(nvdimm, key, i == OP_ERASE > + ? NVDIMM_USER : NVDIMM_MASTER); > } else if (i == OP_OVERWRITE) { > dev_dbg(dev, "overwrite %u\n", key); > + if (atomic_read(&nvdimm->busy)) { > + dev_dbg(dev, "Unable to overwrite while DIMM > active.\n"); > + return -EBUSY; > + } > rc = nvdimm_security_overwrite(nvdimm, key); > - } else if (i == OP_MASTER_UPDATE) { > - dev_dbg(dev, "master_update %u %u\n", key, newkey); > - rc = nvdimm_security_update(nvdimm, key, newkey, > - NVDIMM_MASTER); > - } else if (i == OP_MASTER_ERASE) { > - dev_dbg(dev, "master_erase %u\n", key); > - rc = nvdimm_security_erase(nvdimm, key, > - NVDIMM_MASTER); > } else > return -EINVAL; > > diff --git a/drivers/nvdimm/security.c b/drivers/nvdimm/security.c > index 5862d0eee9db..2166e627383a 100644 > --- a/drivers/nvdimm/security.c > +++ b/drivers/nvdimm/security.c > @@ -334,11 +334,6 @@ int nvdimm_security_erase(struct nvdimm *nvdimm, > unsigned int keyid, > || !nvdimm->sec.flags) > return -EOPNOTSUPP; > > - if (atomic_read(&nvdimm->busy)) { > - dev_dbg(dev, "Unable to secure erase while DIMM active.\n"); > - return -EBUSY; > - } > - > rc = check_security_state(nvdimm); > if (rc) > return rc; > @@ -380,11 +375,6 @@ int nvdimm_security_overwrite(struct nvdimm *nvdimm, > unsigned int keyid) > || !nvdimm->sec.flags) > return -EOPNOTSUPP; > > - if (atomic_read(&nvdimm->busy)) { > - dev_dbg(dev, "Unable to overwrite while DIMM active.\n"); > - return -EBUSY; > - } > - > if (dev->driver == NULL) { > dev_dbg(dev, "Unable to overwrite while DIMM active.\n"); > return -EINVAL; > > ___ > Linux-nvdimm mailing list > linux-nvd...@lists.01.org > https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH 1/3] libnvdimm/security: Introduce a 'frozen' attribute
Dan Williams writes: > In the process of debugging a system with an NVDIMM that was failing to > unlock it was found that the kernel is reporting 'locked' while the DIMM > security interface is 'frozen'. Unfortunately the security state is > tracked internally as an enum which prevents it from communicating the > difference between 'locked' and 'locked + frozen'. It follows that the > enum also prevents the kernel from communicating 'unlocked + frozen' > which would be useful for debugging why security operations like 'change > passphrase' are disabled. > > Ditch the security state enum for a set of flags and introduce a new > sysfs attribute explicitly for the 'frozen' state. The regression risk > is low because the 'frozen' state was already blocked behind the > 'locked' state, but will need to revisit if there were cases where > applications need 'frozen' to show up in the primary 'security' > attribute. The expectation is that communicating 'frozen' is mostly a > helper for debug and status monitoring. > > Cc: Dave Jiang > Reported-by: Jeff Moyer > Signed-off-by: Dan Williams > --- > drivers/acpi/nfit/intel.c| 65 ++--- > drivers/nvdimm/bus.c |2 - > drivers/nvdimm/dimm_devs.c | 59 +-- > drivers/nvdimm/nd-core.h | 21 ++-- > drivers/nvdimm/security.c| 99 > ++ > include/linux/libnvdimm.h|9 ++- > tools/testing/nvdimm/dimm_devs.c | 19 ++- > 7 files changed, 146 insertions(+), 128 deletions(-) > > diff --git a/drivers/acpi/nfit/intel.c b/drivers/acpi/nfit/intel.c > index cddd0fcf622c..2c51ca4155dc 100644 > --- a/drivers/acpi/nfit/intel.c > +++ b/drivers/acpi/nfit/intel.c > @@ -7,10 +7,11 @@ > #include "intel.h" > #include "nfit.h" > > -static enum nvdimm_security_state intel_security_state(struct nvdimm *nvdimm, > +static unsigned long intel_security_flags(struct nvdimm *nvdimm, > enum nvdimm_passphrase_type ptype) > { > struct nfit_mem *nfit_mem = nvdimm_provider_data(nvdimm); > + unsigned long security_flags = 0; > struct { > struct nd_cmd_pkg pkg; > struct nd_intel_get_security_state cmd; > @@ -27,46 +28,54 @@ static enum nvdimm_security_state > intel_security_state(struct nvdimm *nvdimm, > int rc; > > if (!test_bit(NVDIMM_INTEL_GET_SECURITY_STATE, &nfit_mem->dsm_mask)) > - return -ENXIO; > + return 0; > > /* >* Short circuit the state retrieval while we are doing overwrite. >* The DSM spec states that the security state is indeterminate >* until the overwrite DSM completes. >*/ > - if (nvdimm_in_overwrite(nvdimm) && ptype == NVDIMM_USER) > - return NVDIMM_SECURITY_OVERWRITE; > + if (nvdimm_in_overwrite(nvdimm) && ptype == NVDIMM_USER) { > + set_bit(NVDIMM_SECURITY_OVERWRITE, &security_flags); > + return security_flags; > + } Why not just return BIT(NVDIMM_SECURITY_OVERWRITE); ? > rc = nvdimm_ctl(nvdimm, ND_CMD_CALL, &nd_cmd, sizeof(nd_cmd), NULL); > - if (rc < 0) > - return rc; > - if (nd_cmd.cmd.status) > - return -EIO; > + if (rc < 0 || nd_cmd.cmd.status) { > + pr_err("%s: security state retrieval failed (%d:%#x)\n", > + nvdimm_name(nvdimm), rc, nd_cmd.cmd.status); > + return 0; > + } > > /* check and see if security is enabled and locked */ > if (ptype == NVDIMM_MASTER) { > if (nd_cmd.cmd.extended_state & ND_INTEL_SEC_ESTATE_ENABLED) > - return NVDIMM_SECURITY_UNLOCKED; > - else if (nd_cmd.cmd.extended_state & > - ND_INTEL_SEC_ESTATE_PLIMIT) > - return NVDIMM_SECURITY_FROZEN; > - } else { > - if (nd_cmd.cmd.state & ND_INTEL_SEC_STATE_UNSUPPORTED) > - return -ENXIO; > - else if (nd_cmd.cmd.state & ND_INTEL_SEC_STATE_ENABLED) { > - if (nd_cmd.cmd.state & ND_INTEL_SEC_STATE_LOCKED) > - return NVDIMM_SECURITY_LOCKED; > - else if (nd_cmd.cmd.state & ND_INTEL_SEC_STATE_FROZEN > - || nd_cmd.cmd.state & > - ND_INTEL_SEC_STATE_PLIMIT) > - ret
Re: [PATCH] aio: add timeout validity check for io_[p]getevents
Al Viro writes: > On Mon, Jul 29, 2019 at 10:57:41AM -0400, Jeff Moyer wrote: >> Al, can you take this through your tree? > > Umm... Can do, but I had an impression that Arnd and Deepa > had a tree for timespec-related work. OTOH, it had been > relatively quiet last cycle, so... If they have nothing > in the area, I can take it through vfs.git. Hmm, okay. Yi, can you repost the patch, adding my Reviewed-by tag, and CC-ing Arnd and Deepa: Arnd Bergmann Deepa Dinamani Thanks! Jeff
Re: [PATCH] aio: add timeout validity check for io_[p]getevents
Al, can you take this through your tree? Thanks, Jeff Jeff Moyer writes: > "zhangyi (F)" writes: > >> io_[p]getevents syscall should return -EINVAL if if timeout is out of >> range, add this validity check. >> >> Signed-off-by: zhangyi (F) >> --- >> fs/aio.c | 11 +-- >> 1 file changed, 9 insertions(+), 2 deletions(-) >> >> diff --git a/fs/aio.c b/fs/aio.c >> index 01e0fb9..dd967a0 100644 >> --- a/fs/aio.c >> +++ b/fs/aio.c >> @@ -2031,10 +2031,17 @@ static long do_io_getevents(aio_context_t ctx_id, >> struct io_event __user *events, >> struct timespec64 *ts) >> { >> -ktime_t until = ts ? timespec64_to_ktime(*ts) : KTIME_MAX; >> -struct kioctx *ioctx = lookup_ioctx(ctx_id); >> +ktime_t until = KTIME_MAX; >> +struct kioctx *ioctx = NULL; >> long ret = -EINVAL; >> >> +if (ts) { >> +if (!timespec64_valid(ts)) >> +return ret; >> +until = timespec64_to_ktime(*ts); >> +} >> + >> +ioctx = lookup_ioctx(ctx_id); >> if (likely(ioctx)) { >> if (likely(min_nr <= nr && min_nr >= 0)) >> ret = read_events(ioctx, min_nr, nr, events, until); > > Reviewed-by: Jeff Moyer > > The previous suggestion[1] of fixing the helpers never materialized, so > let's just get this fixed, already. > > -Jeff > > [1] https://marc.info/?l=linux-fsdevel&m=152209450618587&w=2 > > -- > To unsubscribe, send a message with 'unsubscribe linux-aio' in > the body to majord...@kvack.org. For more info on Linux AIO, > see: http://www.kvack.org/aio/ > Don't email: mailto:"a...@kvack.org";>a...@kvack.org
Re: [PATCH] aio: add timeout validity check for io_[p]getevents
"zhangyi (F)" writes: > io_[p]getevents syscall should return -EINVAL if if timeout is out of > range, add this validity check. > > Signed-off-by: zhangyi (F) > --- > fs/aio.c | 11 +-- > 1 file changed, 9 insertions(+), 2 deletions(-) > > diff --git a/fs/aio.c b/fs/aio.c > index 01e0fb9..dd967a0 100644 > --- a/fs/aio.c > +++ b/fs/aio.c > @@ -2031,10 +2031,17 @@ static long do_io_getevents(aio_context_t ctx_id, > struct io_event __user *events, > struct timespec64 *ts) > { > - ktime_t until = ts ? timespec64_to_ktime(*ts) : KTIME_MAX; > - struct kioctx *ioctx = lookup_ioctx(ctx_id); > + ktime_t until = KTIME_MAX; > + struct kioctx *ioctx = NULL; > long ret = -EINVAL; > > + if (ts) { > + if (!timespec64_valid(ts)) > + return ret; > + until = timespec64_to_ktime(*ts); > + } > + > + ioctx = lookup_ioctx(ctx_id); > if (likely(ioctx)) { > if (likely(min_nr <= nr && min_nr >= 0)) > ret = read_events(ioctx, min_nr, nr, events, until); Reviewed-by: Jeff Moyer The previous suggestion[1] of fixing the helpers never materialized, so let's just get this fixed, already. -Jeff [1] https://marc.info/?l=linux-fsdevel&m=152209450618587&w=2
Re: [PATCH -next] mm/hotplug: skip bad PFNs from pfn_to_online_page()
"Aneesh Kumar K.V" writes: > On 6/14/19 10:06 PM, Dan Williams wrote: >> On Fri, Jun 14, 2019 at 9:26 AM Aneesh Kumar K.V >> wrote: > >>> Why not let the arch >>> arch decide the SUBSECTION_SHIFT and default to one subsection per >>> section if arch is not enabled to work with subsection. >> >> Because that keeps the implementation from ever reaching a point where >> a namespace might be able to be moved from one arch to another. If we >> can squash these arch differences then we can have a common tool to >> initialize namespaces outside of the kernel. The one wrinkle is >> device-dax that wants to enforce the mapping size, > > The fsdax have a much bigger issue right? The file system block size > is the same as PAGE_SIZE and we can't make it portable across archs > that support different PAGE_SIZE? File system blocks are not tied to page size. They can't be *bigger* than the page size currently, but they can be smaller. Still, I don't see that as an arugment against trying to make the namespaces work across architectures. Consider a user who only has sector mode namespaces. We'd like that to work if at all possible. -Jeff
Re: [PATCH 7/7] libnvdimm/pfn: Fix 'start_pad' implementation
Dan Williams writes: >> Great! Now let's create another one. >> >> # ndctl create-namespace -m fsdax -s 132m >> libndctl: ndctl_pfn_enable: pfn1.1: failed to enable >> Error: namespace1.2: failed to enable >> >> failed to create namespace: No such device or address >> >> (along with a kernel warning spew) > > I assume you're seeing this on the libnvdimm-pending branch? Yes, but also on linus' master branch. Things have been operating in this manner for some time. >> I understand the desire for expediency. At some point, though, we have >> to address the root of the problem. > > Well, you've defibrillated me back to reality. We've suffered the > incomplete broken hacks for 2 years, what's another 10 weeks? I'll > dust off the sub-section patches and take another run at it. OK, thanks. Let me know if I can help at all. Cheers, Jeff
Re: [PATCH 7/7] libnvdimm/pfn: Fix 'start_pad' implementation
Dan Williams writes: >> > However, to fix this situation a non-backwards compatible change >> > needs to be made to the interpretation of the nd_pfn info-block. >> > ->start_pad needs to be accounted in ->map.map_offset (formerly >> > ->data_offset), and ->map.map_base (formerly ->phys_addr) needs to be >> > adjusted to the section aligned resource base used to establish >> > ->map.map formerly (formerly ->virt_addr). >> > >> > The guiding principles of the info-block compatibility fixup is to >> > maintain the interpretation of ->data_offset for implementations like >> > the EFI driver that only care about data_access not dax, but cause older >> > Linux implementations that care about the mode and dax to fail to parse >> > the new info-block. >> >> What if the core mm grew support for hotplug on sub-section boundaries? >> Would't that fix this problem (and others)? > > Yes, I think it would, and I had patches along these lines [2]. Last > time I looked at this I was asked by core-mm folks to await some > general refactoring of hotplug [3], and I wasn't proud about some of > the hacks I used to make it work. In general I'm less confident about > being able to get sub-section-hotplug over the goal line (core-mm > resistance to hotplug complexity) vs the local hacks in nvdimm to deal > with this breakage. You first posted that patch series in December of 2016. How long do we wait for this refactoring to happen? Meanwhile, we've been kicking this can down the road for far too long. Simple namespace creation fails to work. For example: # ndctl create-namespace -m fsdax -s 128m Error: '--size=' must align to interleave-width: 6 and alignment: 2097152 did you intend --size=132M? failed to create namespace: Invalid argument ok, I can't actually create a small, section-aligned namespace. Let's bump it up: # ndctl create-namespace -m fsdax -s 132m { "dev":"namespace1.0", "mode":"fsdax", "map":"dev", "size":"126.00 MiB (132.12 MB)", "uuid":"2a5f8fe0-69e2-46bf-98bc-0f5667cd810a", "raw_uuid":"f7324317-5cd2-491e-8cd1-ad03770593f2", "sector_size":512, "blockdev":"pmem1", "numa_node":1 } Great! Now let's create another one. # ndctl create-namespace -m fsdax -s 132m libndctl: ndctl_pfn_enable: pfn1.1: failed to enable Error: namespace1.2: failed to enable failed to create namespace: No such device or address (along with a kernel warning spew) And at this point, all further ndctl create-namespace commands fail. Lovely. This is a wart that was acceptable only because a fix was coming. 2+ years later, and we're still adding hacks to work around it (and there have been *several* hacks). > Local hacks are always a sad choice, but I think leaving these > configurations stranded for another kernel cycle is not tenable. It > wasn't until the github issue did I realize that the problem was > happening in the wild on NVDIMM-N platforms. I understand the desire for expediency. At some point, though, we have to address the root of the problem. -Jeff > > [2]: > https://lore.kernel.org/lkml/148964440651.19438.2288075389153762985.st...@dwillia2-desk3.amr.corp.intel.com/ > [3]: https://lore.kernel.org/lkml/20170319163531.ga25...@dhcp22.suse.cz/ > >> >> -Jeff >> >> [1] https://github.com/pmem/ndctl/issues/76
Re: [PATCH 7/7] libnvdimm/pfn: Fix 'start_pad' implementation
Hi, Dan, Thanks for the comprehensive write-up. Comments below. Dan Williams writes: > In the beginning the pmem driver simply passed the persistent memory > resource range to memremap and was done. With the introduction of > devm_memremap_pages() and vmem_altmap the implementation needed to > contend with metadata at the start of the resource to indicate whether > the vmemmap is located in System RAM or Persistent Memory, and reserve > vmemmap capacity in pmem for the latter case. > > The indication of metadata space was communicated in the > nd_pfn->data_offset property and it was defined to be identical to the > pmem_device->data_offset property, i.e. relative to the raw resource > base of the namespace. Up until this point in the driver's development > pmem_device->phys_addr == __pa(pmem_device->virt_addr). This > implementation was fine up until the discovery of platforms with > physical address layouts that mapped Persistent Memory and System RAM to > the same Linux memory hotplug section (128MB span). > > The nd_pfn->start_pad and nd_pfn->end_trunc properties were introduced > to pad and truncate the capacity to fit within an exclusive Linux > memory hotplug section span, and it was at this point where the > ->start_pad definition did not comprehend the pmem_device->phys_addr to > pmem_device->virt_addr relationship. Platforms in the wild typically > only collided 'System RAM' at the end of the Persistent Memory range so > ->start_pad was often zero. > > Lately Linux has encountered platforms that collide Persistent Memory > regions between each other, specifically cases where ->start_pad needed > to be non-zero. This lead to commit ae86cbfef381 "libnvdimm, pfn: Pad > pfn namespaces relative to other regions". That commit allowed > namespaces to be mapped with devm_memremap_pages(). However dax > operations on those configurations currently fail if attempted within the > ->start_pad range because pmem_device->data_offset was still relative to > raw resource base not relative to the section aligned resource range > mapped by devm_memremap_pages(). > > Luckily __bdev_dax_supported() caught these failures and simply disabled > dax. Let me make sure I understand the current state of things. Assume a machine with two persistent memory ranges overlapping the same hotplug memory section. Let's take the example from the ndctl github issue[1]: 187c00-967bff : Persistent Memory /sys/bus/nd/devices/region0/resource: 0x187c00 /sys/bus/nd/devices/region1/resource: 0x577c00 Create a namespace in region1. That namespace will have a start_pad of 64MiB. The problem is that, while the correct offset was specified when laying out the struct pages (via arch_add_memory), the data_offset for the pmem block device itself does not take the start_pad into account (despite the comment in the nd_pfn_sb data structure!). As a result, the block device starts at the beginning of the address range, but struct pages only exist for the address space starting 64MiB into the range. __bdev_dax_supported fails, because it tries to perform a direct_access call on sector 0, and there's no pgmap for the address corresponding to that sector. So, we can't simply make the code correct (by adding the start_pad to pmem->data_offset) without bumping the superblock version, because that would change the size of the block device, and the location of data on that block device would all be off by 64MiB (and you'd lose the first 64MiB). Mass hysteria. > However, to fix this situation a non-backwards compatible change > needs to be made to the interpretation of the nd_pfn info-block. > ->start_pad needs to be accounted in ->map.map_offset (formerly > ->data_offset), and ->map.map_base (formerly ->phys_addr) needs to be > adjusted to the section aligned resource base used to establish > ->map.map formerly (formerly ->virt_addr). > > The guiding principles of the info-block compatibility fixup is to > maintain the interpretation of ->data_offset for implementations like > the EFI driver that only care about data_access not dax, but cause older > Linux implementations that care about the mode and dax to fail to parse > the new info-block. What if the core mm grew support for hotplug on sub-section boundaries? Would't that fix this problem (and others)? -Jeff [1] https://github.com/pmem/ndctl/issues/76
Re: [PATCH 0/7] libnvdimm/pfn: Fix section-alignment padding
Dan Williams writes: > On Tue, Feb 12, 2019 at 1:37 PM Dan Williams wrote: >> >> Lately Linux has encountered platforms that collide Persistent Memory >> regions between each other, specifically cases where ->start_pad needed >> to be non-zero. This lead to commit ae86cbfef381 "libnvdimm, pfn: Pad >> pfn namespaces relative to other regions". That commit allowed >> namespaces to be mapped with devm_memremap_pages(). However dax >> operations on those configurations currently fail if attempted within the >> ->start_pad range because pmem_device->data_offset was still relative to >> raw resource base not relative to the section aligned resource range >> mapped by devm_memremap_pages(). >> >> Luckily __bdev_dax_supported() caught these failures and simply disabled >> dax. However, to fix this situation a non-backwards compatible change >> needs to be made to the interpretation of the nd_pfn info-block. >> ->start_pad needs to be accounted in ->map.map_offset (formerly >> ->data_offset), and ->map.map_base (formerly ->phys_addr) needs to be >> adjusted to the section aligned resource base used to establish >> ->map.map formerly (formerly ->virt_addr). >> >> See patch 7 "libnvdimm/pfn: Fix 'start_pad' implementation" for more >> details, and the ndctl patch series "Improve support + testing for >> labels + info-blocks" for the corresponding regression test. > > Hello valued reviewers, can I plead for a sanity check of at least > "libnvdimm/pfn: Introduce super-block minimum version requirements" > and "libnvdimm/pfn: Fix 'start_pad' implementation"? In particular > Jeff / Johannes this has end user / distro impact in that users may > lose access to namespaces that are upgraded to v1.3 info-blocks and > then boot an old kernel. I did not see a way around that sharp edge. Yes, I'll take a look. Cheers, Jeff
Re: [PATCH] acpi/nfit: Fix bus command validation
Dan Williams writes: > Commit 11189c1089da "acpi/nfit: Fix command-supported detection" broke > ND_CMD_CALL for bus-level commands. The "func = cmd" assumption is only > valid for: > > ND_CMD_ARS_CAP > ND_CMD_ARS_START > ND_CMD_ARS_STATUS > ND_CMD_CLEAR_ERROR > > The function number otherwise needs to be pulled from the command > payload for: > > NFIT_CMD_TRANSLATE_SPA > NFIT_CMD_ARS_INJECT_SET > NFIT_CMD_ARS_INJECT_CLEAR > NFIT_CMD_ARS_INJECT_GET > > Update cmd_to_func() for the bus case and call it in the common path. > > Fixes: 11189c1089da ("acpi/nfit: Fix command-supported detection") > Cc: > Cc: Vishal Verma > Reported-by: Grzegorz Burzynski > Signed-off-by: Dan Williams Tricky code path, eh? Tested-by: Jeff Moyer -Jeff > --- > drivers/acpi/nfit/core.c | 22 -- > 1 file changed, 12 insertions(+), 10 deletions(-) > > diff --git a/drivers/acpi/nfit/core.c b/drivers/acpi/nfit/core.c > index e18ade5d74e9..c34c595d6bb0 100644 > --- a/drivers/acpi/nfit/core.c > +++ b/drivers/acpi/nfit/core.c > @@ -415,7 +415,7 @@ static int cmd_to_func(struct nfit_mem *nfit_mem, > unsigned int cmd, > if (call_pkg) { > int i; > > - if (nfit_mem->family != call_pkg->nd_family) > + if (nfit_mem && nfit_mem->family != call_pkg->nd_family) > return -ENOTTY; > > for (i = 0; i < ARRAY_SIZE(call_pkg->nd_reserved2); i++) > @@ -424,6 +424,10 @@ static int cmd_to_func(struct nfit_mem *nfit_mem, > unsigned int cmd, > return call_pkg->nd_command; > } > > + /* In the !call_pkg case, bus commands == bus functions */ > + if (!nfit_mem) > + return cmd; > + > /* Linux ND commands == NVDIMM_FAMILY_INTEL function numbers */ > if (nfit_mem->family == NVDIMM_FAMILY_INTEL) > return cmd; > @@ -454,17 +458,18 @@ int acpi_nfit_ctl(struct nvdimm_bus_descriptor > *nd_desc, struct nvdimm *nvdimm, > if (cmd_rc) > *cmd_rc = -EINVAL; > > + if (cmd == ND_CMD_CALL) > + call_pkg = buf; > + func = cmd_to_func(nfit_mem, cmd, call_pkg); > + if (func < 0) > + return func; > + > if (nvdimm) { > struct acpi_device *adev = nfit_mem->adev; > > if (!adev) > return -ENOTTY; > > - if (cmd == ND_CMD_CALL) > - call_pkg = buf; > - func = cmd_to_func(nfit_mem, cmd, call_pkg); > - if (func < 0) > - return func; > dimm_name = nvdimm_name(nvdimm); > cmd_name = nvdimm_cmd_name(cmd); > cmd_mask = nvdimm_cmd_mask(nvdimm); > @@ -475,12 +480,9 @@ int acpi_nfit_ctl(struct nvdimm_bus_descriptor *nd_desc, > struct nvdimm *nvdimm, > } else { > struct acpi_device *adev = to_acpi_dev(acpi_desc); > > - func = cmd; > cmd_name = nvdimm_bus_cmd_name(cmd); > cmd_mask = nd_desc->cmd_mask; > - dsm_mask = cmd_mask; > - if (cmd == ND_CMD_CALL) > - dsm_mask = nd_desc->bus_dsm_mask; > + dsm_mask = nd_desc->bus_dsm_mask; > desc = nd_cmd_bus_desc(cmd); > guid = to_nfit_uuid(NFIT_DEV_BUS); > handle = adev->handle; > > ___ > Linux-nvdimm mailing list > linux-nvd...@lists.01.org > https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [RFC PATCH] aio: add check for timeout to aviod invalid value
Tan Xiaojun writes: > (When I was testing with syzkaller, I found a lot of ubsan problems. Here > is one of them. I am not sure if it needs to be fixed and how it will be > fixed. So I sent this patch to ask your opinion.) > > Syzkaller reported a UBSAN bug below, which was mainly caused by a large > negative number passed to the timeout of the io_getevents system call. > > > UBSAN: Undefined behaviour in ./include/linux/ktime.h:42:14 > signed integer overflow: > -8427032702788048137 * 10 cannot be represented in type 'long long > int' > CPU: 3 PID: 11668 Comm: syz-executor0 Not tainted 4.19.18-514.55.6.9.x86_64+ > #1 > Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 > 04/01/2014 > Call Trace: > __dump_stack lib/dump_stack.c:77 [inline] > dump_stack+0xca/0x13e lib/dump_stack.c:113 > ubsan_epilogue+0xe/0x81 lib/ubsan.c:159 > handle_overflow+0x193/0x1e2 lib/ubsan.c:190 > ktime_set include/linux/ktime.h:42 [inline] > timespec64_to_ktime include/linux/ktime.h:78 [inline] > do_io_getevents+0x307/0x390 fs/aio.c:2043 > __do_sys_io_getevents fs/aio.c:2080 [inline] > __se_sys_io_getevents fs/aio.c:2068 [inline] > __x64_sys_io_getevents+0x163/0x250 fs/aio.c:2068 > do_syscall_64+0xc8/0x580 arch/x86/entry/common.c:290 > entry_SYSCALL_64_after_hwframe+0x49/0xbe > RIP: 0033:0x462589 > Code: f7 d8 64 89 02 b8 ff ff ff ff c3 66 0f 1f 44 00 00 48 89 f8 48 89 f7 48 > 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 > 01 c3 48 c7 c1 bc ff ff ff f7 d8 64 89 01 48 > RSP: 002b:7fde9b04ec58 EFLAGS: 0246 ORIG_RAX: 00d0 > RAX: ffda RBX: 0072bf00 RCX: 00462589 > RDX: 0006 RSI: RDI: > RBP: 0005 R08: 2100 R09: > R10: 2040 R11: 0246 R12: 7fde9b04f6bc > R13: 004bd1f0 R14: 006f6b60 R15: > > bond0 (unregistering): Released all slaves > > The timeout described in "man io_getevents" does not say whether it > can be negative or not, but as a waiting time, a negative number has > no meaning. So I add check to avoid this case. It's embarrassing that this bug is still present. See, for example, this discussion, started in 2015: https://lore.kernel.org/lkml/cact4y+bbxvylq6ltokrktnlthqlhcw-bmp3aqp3mjdavr9f...@mail.gmail.com/ I could swear it was brought up again since then, but I can't find records of that. > Signed-off-by: Tan Xiaojun > --- > fs/aio.c | 7 ++- > 1 file changed, 6 insertions(+), 1 deletion(-) > > diff --git a/fs/aio.c b/fs/aio.c > index f4d..28e0fa6 100644 > --- a/fs/aio.c > +++ b/fs/aio.c > @@ -2078,10 +2078,15 @@ static long do_io_getevents(aio_context_t ctx_id, > struct io_event __user *events, > struct timespec64 *ts) > { > - ktime_t until = ts ? timespec64_to_ktime(*ts) : KTIME_MAX; > + ktime_t until; > struct kioctx *ioctx = lookup_ioctx(ctx_id); > long ret = -EINVAL; > > + if (ts && !timespec64_valid(ts)) > + return -EINVAL; > + > + until = ts ? timespec64_to_ktime(*ts) : KTIME_MAX; > + > if (likely(ioctx)) { > if (likely(min_nr <= nr && min_nr >= 0)) > ret = read_events(ioctx, min_nr, nr, events, until); Looks good to me. Thanks for fixing this. Reviewed-by: Jeff Moyer
Re: [PATCH] blk-mq: fix the cmd_flag_name array
Jianchao Wang writes: > Swap REQ_NOWAIT and REQ_NOUNMAP and add REQ_HIPRI. > > Signed-off-by: Jianchao Wang > --- > block/blk-mq-debugfs.c | 3 ++- > 1 file changed, 2 insertions(+), 1 deletion(-) > > diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c > index 90d6876..f812083 100644 > --- a/block/blk-mq-debugfs.c > +++ b/block/blk-mq-debugfs.c > @@ -308,8 +308,9 @@ static const char *const cmd_flag_name[] = { > CMD_FLAG_NAME(PREFLUSH), > CMD_FLAG_NAME(RAHEAD), > CMD_FLAG_NAME(BACKGROUND), > - CMD_FLAG_NAME(NOUNMAP), > CMD_FLAG_NAME(NOWAIT), > + CMD_FLAG_NAME(NOUNMAP), > + CMD_FLAG_NAME(HIPRI), > }; > #undef CMD_FLAG_NAME Acked-by: Jeff Moyer You might consider also adding a comment above the req_flag_bits enum noting that modifications also need to be propagated to cmd_flag_name.
Re: [PATCH 0/4] Allow persistent memory to be used like normal RAM
Keith Busch writes: >> Keith, you seem to be implying that there are platforms that won't >> support memory mode. Do you also have some insight into how customers >> want to use this, beyond my speculation? It's really frustrating to see >> patch sets like this go by without any real use cases provided. > > Right, most NFIT reporting platforms today don't have memory mode, and > the kernel currently only supports the persistent DAX mode with these. > This series adds another option for those platforms. All NFIT reporting platforms today are shipping NVDIMM-Ns, where it makes absolutely no sense to use them as regular DRAM. I don't think that's a good argument to make. > I think numactl as you mentioned is the first consideration for how > customers may make use. Dave or Dan might have other use cases in mind. Well, it sure looks like this took a lot of work, so I thought there were known use cases or users asking for this functionality. Cheers, Jeff
Re: [PATCH 0/4] Allow persistent memory to be used like normal RAM
Keith Busch writes: > On Thu, Jan 17, 2019 at 11:29:10AM -0500, Jeff Moyer wrote: >> Dave Hansen writes: >> > Persistent memory is cool. But, currently, you have to rewrite >> > your applications to use it. Wouldn't it be cool if you could >> > just have it show up in your system like normal RAM and get to >> > it like a slow blob of memory? Well... have I got the patch >> > series for you! >> >> So, isn't that what memory mode is for? >> >> https://itpeernetwork.intel.com/intel-optane-dc-persistent-memory-operating-modes/ >> >> Why do we need this code in the kernel? > > I don't think those are the same thing. The "memory mode" in the link > refers to platforms that sequester DRAM to side cache memory access, where > this series doesn't have that platform dependency nor hides faster DRAM. OK, so you are making two arguments, here. 1) platforms may not support memory mode, and 2) this series allows for performance differentiated memory (even though applications may not modified to make use of that...). With this patch set, an unmodified application would either use: 1) whatever memory it happened to get 2) only the faster dram (via numactl --membind=) 3) only the slower pmem (again, via numactl --membind1) 4) preferentially one or the other (numactl --preferred=) The other options are: - as mentioned above, memory mode, which uses DRAM as a cache for the slower persistent memory. Note that it isn't all or nothing--you can configure your system with both memory mode and appdirect. The limitation, of course, is that your platform has to support this. This seems like the obvious solution if you want to make use of the larger pmem capacity as regular volatile memory (and your platform supports it). But maybe there is some other limitation that motivated this work? - libmemkind or pmdk. These options typically* require application modifications, but allow those applications to actively decide which data lives in fast versus slow media. This seems like the obvious answer for applications that care about access latency. * you could override the system malloc, but some libraries/application stacks already do that, so it isn't a universal solution. Listing something like this in the headers of these patch series would considerably reduce the head-scratching for reviewers. Keith, you seem to be implying that there are platforms that won't support memory mode. Do you also have some insight into how customers want to use this, beyond my speculation? It's really frustrating to see patch sets like this go by without any real use cases provided. Cheers, Jeff
Re: [PATCH 0/4] Allow persistent memory to be used like normal RAM
Dave Hansen writes: > Persistent memory is cool. But, currently, you have to rewrite > your applications to use it. Wouldn't it be cool if you could > just have it show up in your system like normal RAM and get to > it like a slow blob of memory? Well... have I got the patch > series for you! So, isn't that what memory mode is for? https://itpeernetwork.intel.com/intel-optane-dc-persistent-memory-operating-modes/ Why do we need this code in the kernel? -Jeff
Re: [PATCH v3 0/2] acpi/nfit: Fix command-supported detection
Dan Williams writes: > Changes since v2 [1]: > * Don't allow ND_CMD_CALL to bypass dsm_mask restrictions (Jeff) > > [1]: https://lists.01.org/pipermail/linux-nvdimm/2019-January/019498.html > > --- > > One last resend to make sure all the last bits of thrash have settled. LGTM. Thanks! Jeff
Re: [PATCH v2 0/2] acpi/nfit: Fix command-supported detection
Dan Williams writes: > On Tue, Jan 15, 2019 at 6:16 AM Jeff Moyer wrote: >> >> Dan Williams writes: >> >> > Changes since v1 [1]: >> > * Include another patch make sure that function-number zero can be >> > safely used as an invalid function number (Jeff) >> > * Add a comment clarifying why zero is an invalid function number (Jeff) >> > * Pass nfit_mem to cmd_to_func() (Jeff) >> > * Collect a Tested-by from Sujith >> > [1]: https://lists.01.org/pipermail/linux-nvdimm/2019-January/019435.html >> >> For the series: >> >> Reviewed-by: Jeff Moyer >> >> Thanks, Dan! > > Thanks, although I just realized one more change. The ND_CMD_CALL case > should zero out command after the function translation, otherwise > userspace can call functions that the kernel is blocking in the > dsm_mask. > > Holler if this invalidates your "Reviewed-by". AAAHH :) > diff --git a/drivers/acpi/nfit/core.c b/drivers/acpi/nfit/core.c > index 87e02f281e51..d7747aceb7ab 100644 > --- a/drivers/acpi/nfit/core.c > +++ b/drivers/acpi/nfit/core.c > @@ -463,6 +463,12 @@ int acpi_nfit_ctl(struct nvdimm_bus_descriptor > *nd_desc, struct nvdimm *nvdimm, > func = cmd_to_func(nfit_mem, cmd, buf); > if (func < 0) > return func; > + /* > +* In the ND_CMD_CALL case we're now dependent on 'func' > +* being validated by the dimm's dsm_mask > +*/ > + if (cmd == ND_CMD_CALL) > + cmd = 0; > dimm_name = nvdimm_name(nvdimm); > cmd_name = nvdimm_cmd_name(cmd); > cmd_mask = nvdimm_cmd_mask(nvdimm); dsm_mask = nfit_mem->dsm_mask; desc = nd_cmd_dimm_desc(cmd); That sure doesn't look right. Now cmd_name and desc will be wrong. > @@ -477,8 +483,10 @@ int acpi_nfit_ctl(struct nvdimm_bus_descriptor > *nd_desc, struct nvdimm *nvdimm, > cmd_name = nvdimm_bus_cmd_name(cmd); > cmd_mask = nd_desc->cmd_mask; > dsm_mask = cmd_mask; > - if (cmd == ND_CMD_CALL) > + if (cmd == ND_CMD_CALL) { > dsm_mask = nd_desc->bus_dsm_mask; > + cmd = 0; > + } > desc = nd_cmd_bus_desc(cmd); And again here. We could reorder the zeroing, or you could modify the check for a valid comand/function. Something like this? /* * Check for a valid command. For ND_CMD_CALL, we also * have to make sure that the DSM function is supported. */ if (cmd == ND_CMD_CALL && !test_bit(func, &dsm_mask)) return -ENOTTY; else if (!test_bit(cmd, &cmd_mask)) return -ENOTTY; Which way do you think is cleaner? Cheers, Jeff
Re: [PATCH v2 0/2] acpi/nfit: Fix command-supported detection
Dan Williams writes: > Changes since v1 [1]: > * Include another patch make sure that function-number zero can be > safely used as an invalid function number (Jeff) > * Add a comment clarifying why zero is an invalid function number (Jeff) > * Pass nfit_mem to cmd_to_func() (Jeff) > * Collect a Tested-by from Sujith > [1]: https://lists.01.org/pipermail/linux-nvdimm/2019-January/019435.html For the series: Reviewed-by: Jeff Moyer Thanks, Dan! > > --- > > Quote patch2 changelog: > > The _DSM function number validation only happens to succeed when the > generic Linux command number translation corresponds with a > DSM-family-specific function number. This breaks NVDIMM-N > implementations that correctly implement _LSR, _LSW, and _LSI, but do > not happen to publish support for DSM function numbers 4, 5, and 6. > > Recall that the support for _LS{I,R,W} family of methods results in the > DIMM being marked as supporting those command numbers at > acpi_nfit_register_dimms() time. The DSM function mask is only used for > ND_CMD_CALL support of non-NVDIMM_FAMILY_INTEL devices. > > --- > > Dan Williams (2): > acpi/nfit: Block function zero DSMs > acpi/nfit: Fix command-supported detection > > > drivers/acpi/nfit/core.c | 53 > ++ > 1 file changed, 39 insertions(+), 14 deletions(-)
Re: [PATCH] acpi/nfit: Fix command-supported detection
Dan Williams writes: > On Mon, Jan 14, 2019 at 8:43 AM Dan Williams wrote: >> On Mon, Jan 14, 2019 at 7:19 AM Jeff Moyer wrote: > [..] >> > > + >> > > + if (cmd == ND_CMD_CALL) { >> > > + int i; >> > > + >> > > + if (call_pkg && nfit_mem->family != call_pkg->nd_family) >> > > + return -ENOTTY; >> > > + >> > > + for (i = 0; i < ARRAY_SIZE(call_pkg->nd_reserved2); i++) >> > > + if (call_pkg->nd_reserved2[i]) >> > > + return -EINVAL; >> > > + return call_pkg->nd_command; >> > > + } >> > > + >> > > + /* Linux ND commands == NVDIMM_FAMILY_INTEL function numbers */ >> > > + if (nfit_mem->family == NVDIMM_FAMILY_INTEL) >> > > + return cmd; >> > > + return 0; >> > >> > Function zero? Is that really the right thing to return here? >> >> Yes, function zero is never set in n > > ...whoops fumble fingered "send" > > Function zero should never be set in nfit_mem->dsm_mask, although the > NVDIMM_FAMILY_MSFT mask violates this assumption. I'll fix that up. OK, I think I see how it all fits together now, thanks. It would be nice if you documented this magical 0 return somehow. Cheers, Jeff
Re: [PATCH] acpi/nfit: Fix command-supported detection
Dan Williams writes: > The _DSM function number validation only happens to succeed when the > generic Linux command number translation corresponds with a > DSM-family-specific function number. This breaks NVDIMM-N > implementations that correctly implement _LSR, _LSW, and _LSI, but do > not happen to publish support for DSM function numbers 4, 5, and 6. > > Recall that the support for _LS{I,R,W} family of methods results in the > DIMM being marked as supporting those command numbers at > acpi_nfit_register_dimms() time. The DSM function mask is only used for > ND_CMD_CALL support of non-NVDIMM_FAMILY_INTEL devices. > > Fixes: 31eca76ba2fc ("nfit, libnvdimm: limited/whitelisted dimm command...") > Cc: > Link: https://github.com/pmem/ndctl/issues/78 > Reported-by: Sujith Pandel > Signed-off-by: Dan Williams > --- > Sujith, this is a larger change than what you originally tested, but it > should behave the same. I wanted to consolidate all the code that > handles Linux command number to DIMM _DSM function number translation. > > If you have a chance to re-test with this it would be much appreciated. > > Thanks for the report! > > drivers/acpi/nfit/core.c | 43 +-- > 1 file changed, 29 insertions(+), 14 deletions(-) > > diff --git a/drivers/acpi/nfit/core.c b/drivers/acpi/nfit/core.c > index 790691d9a982..d5d64e90ae71 100644 > --- a/drivers/acpi/nfit/core.c > +++ b/drivers/acpi/nfit/core.c > @@ -409,6 +409,29 @@ static bool payload_dumpable(struct nvdimm *nvdimm, > unsigned int func) > return true; > } > > +static int cmd_to_func(struct nvdimm *nvdimm, unsigned int cmd, > + struct nd_cmd_pkg *call_pkg) > +{ > + struct nfit_mem *nfit_mem = nvdimm_provider_data(nvdimm); Minor nit: Seems like the function could take an nfit_mem parameter instead of an nvdimm. > + > + if (cmd == ND_CMD_CALL) { > + int i; > + > + if (call_pkg && nfit_mem->family != call_pkg->nd_family) > + return -ENOTTY; > + > + for (i = 0; i < ARRAY_SIZE(call_pkg->nd_reserved2); i++) > + if (call_pkg->nd_reserved2[i]) > + return -EINVAL; > + return call_pkg->nd_command; > + } > + > + /* Linux ND commands == NVDIMM_FAMILY_INTEL function numbers */ > + if (nfit_mem->family == NVDIMM_FAMILY_INTEL) > + return cmd; > + return 0; Function zero? Is that really the right thing to return here? Cheers, Jeff
Re: [PATCH] scsi: isci: initialize shost fully before calling scsi_add_host()
Logan Gunthorpe writes: > scsi_mq_setup_tags(), which is called by scsi_add_host(), calculates > the command size to allocate based on the prot_capabilities. In the > isci driver, scsi_host_set_prot() is called after scsi_add_host() > so the command size gets calculated to be smaller than it needs to be. > Eventually, scsi_mq_init_request() locates the 'prot_sdb' after the command > assuming it was sized correctly and a buffer overrun may occur. [...] > To prevent this, the calls to scsi_host_set_prot() are moved into > isci_host_alloc() before the call to scsi_add_host(). Out of caution, > also move the similar call to scsi_host_set_guard(). > > Fixes: 3d2d75254915 ("[SCSI] isci: T10 DIF support") > Link: > http://lkml.kernel.org/r/da851333-eadd-163a-8c78-e1f4ec5ec...@deltatee.com > Signed-off-by: Logan Gunthorpe > Cc: Intel SCU Linux support > Cc: Artur Paszkiewicz > Cc: "James E.J. Bottomley" > Cc: "Martin K. Petersen" > Cc: Christoph Hellwig > Cc: Jens Axboe > Cc: Jeff Moyer Nice job, and excellent commit message. We'll need a similar patch for lpfc. Reviewed-by: Jeff Moyer > --- > drivers/scsi/isci/init.c | 14 +++--- > 1 file changed, 7 insertions(+), 7 deletions(-) > > diff --git a/drivers/scsi/isci/init.c b/drivers/scsi/isci/init.c > index 68b90c4f79a3..1727d0c71b12 100644 > --- a/drivers/scsi/isci/init.c > +++ b/drivers/scsi/isci/init.c > @@ -576,6 +576,13 @@ static struct isci_host *isci_host_alloc(struct pci_dev > *pdev, int id) > shost->max_lun = ~0; > shost->max_cmd_len = MAX_COMMAND_SIZE; > > + /* turn on DIF support */ > + scsi_host_set_prot(shost, > +SHOST_DIF_TYPE1_PROTECTION | > +SHOST_DIF_TYPE2_PROTECTION | > +SHOST_DIF_TYPE3_PROTECTION); > + scsi_host_set_guard(shost, SHOST_DIX_GUARD_CRC); > + > err = scsi_add_host(shost, &pdev->dev); > if (err) > goto err_shost; > @@ -663,13 +670,6 @@ static int isci_pci_probe(struct pci_dev *pdev, const > struct pci_device_id *id) > goto err_host_alloc; > } > pci_info->hosts[i] = h; > - > - /* turn on DIF support */ > - scsi_host_set_prot(to_shost(h), > -SHOST_DIF_TYPE1_PROTECTION | > -SHOST_DIF_TYPE2_PROTECTION | > -SHOST_DIF_TYPE3_PROTECTION); > - scsi_host_set_guard(to_shost(h), SHOST_DIX_GUARD_CRC); > } > > err = isci_setup_interrupts(pdev);
Re: Regression in v5.0-rc1: Panic at boot
Hi, Logan, Logan Gunthorpe writes: > Hey, > > I found a regression in v5.0-rc1 this morning. My system panics on boot. > I've attached a log of the panic. > > I bisected to find the problematic commit is: > > Fixes: 9d037ad707ed ("block: remove req->timeout_list") > > But it makes no sense to me why this commit would cause a problem like > this. I've attached a bisect log. I've also tested v5.0-rc1 with this > commit reverted and that boots fine. > > The traceback seems to indicate the problem is on the bip_get_seed() > line in t10_pi_complete(). Which suggests that bio_integrity() is > returning NULL. Does your hardware actually support protection information? -Jeff
Re: [PATCH] aio: Convert ioctx_table to XArray
Jens Axboe writes: > On 12/11/18 11:02 AM, Jeff Moyer wrote: >> Matthew Wilcox writes: >> >>> On Tue, Dec 11, 2018 at 12:21:52PM -0500, Jeff Moyer wrote: >>>> I'm going to submit this version formally. If you're interested in >>>> converting the ioctx_table to xarray, you can do that separately from a >>>> security fix. I would include a performance analysis with that patch, >>>> though. The idea of using a radix tree for the ioctx table was >>>> discarded due to performance reasons--see commit db446a08c23d5 ("aio: >>>> convert the ioctx list to table lookup v3"). I suspect using the xarray >>>> will perform similarly. >>> >>> There's a big difference between Octavian's patch and mine. That patch >>> indexed into the radix tree by 'ctx_id' directly, which was pretty >>> much guaranteed to exhibit some close-to-worst-case behaviour from the >>> radix tree due to IDs being sparsely assigned. My patch uses the ring >>> ID which _we_ assigned, and so is nicely behaved, being usually a very >>> small integer. >> >> OK, good to know. I obviously didn't look too closely at the two. >> >>> What performance analysis would you find compelling? Octavian's original >>> fio script: >>> >>>> rw=randrw; size=256k ;directory=/mnt/fio; ioengine=libaio; iodepth=1 >>>> blocksize=1024; numjobs=512; thread; loops=100 >>>> >>>> on an EXT2 filesystem mounted on top of a ramdisk >>> >>> or something else? >> >> I think the most common use case is a small number of ioctx-s, so I'd >> like to see that use case not regress (that should be easy, right?). Bah, I meant a small number of threads doing submit/getevents. >> Kent, what were the tests you were using when doing this work? Jens, >> since you're doing performance work in this area now, are there any >> particular test cases you care about? > > I can give it a spin, ioctx lookup is in the fast path, and for "classic" > aio we do it twice for each IO... Thanks! -Jeff
Re: [PATCH] aio: Convert ioctx_table to XArray
Matthew Wilcox writes: > On Tue, Dec 11, 2018 at 12:21:52PM -0500, Jeff Moyer wrote: >> I'm going to submit this version formally. If you're interested in >> converting the ioctx_table to xarray, you can do that separately from a >> security fix. I would include a performance analysis with that patch, >> though. The idea of using a radix tree for the ioctx table was >> discarded due to performance reasons--see commit db446a08c23d5 ("aio: >> convert the ioctx list to table lookup v3"). I suspect using the xarray >> will perform similarly. > > There's a big difference between Octavian's patch and mine. That patch > indexed into the radix tree by 'ctx_id' directly, which was pretty > much guaranteed to exhibit some close-to-worst-case behaviour from the > radix tree due to IDs being sparsely assigned. My patch uses the ring > ID which _we_ assigned, and so is nicely behaved, being usually a very > small integer. OK, good to know. I obviously didn't look too closely at the two. > What performance analysis would you find compelling? Octavian's original > fio script: > >> rw=randrw; size=256k ;directory=/mnt/fio; ioengine=libaio; iodepth=1 >> blocksize=1024; numjobs=512; thread; loops=100 >> >> on an EXT2 filesystem mounted on top of a ramdisk > > or something else? I think the most common use case is a small number of ioctx-s, so I'd like to see that use case not regress (that should be easy, right?). Kent, what were the tests you were using when doing this work? Jens, since you're doing performance work in this area now, are there any particular test cases you care about? Cheers, Jeff
Re: [PATCH] aio: Convert ioctx_table to XArray
Jeff Moyer writes: > Jeff Moyer writes: > >> Matthew Wilcox writes: >> >>> This custom resizing array was vulnerable to a Spectre attack (speculating >>> off the end of an array to a user-controlled offset). The XArray is >>> not vulnerable to Spectre as it always masks its lookups to be within >>> the bounds of the array. >> >> I'm not a big fan of completely re-writing the code to fix this. Isn't >> the below patch sufficient? > > Too quick on the draw. Here's a patch that compiles. ;-) Hi, Matthew, I'm going to submit this version formally. If you're interested in converting the ioctx_table to xarray, you can do that separately from a security fix. I would include a performance analysis with that patch, though. The idea of using a radix tree for the ioctx table was discarded due to performance reasons--see commit db446a08c23d5 ("aio: convert the ioctx list to table lookup v3"). I suspect using the xarray will perform similarly. Cheers, Jeff > diff --git a/fs/aio.c b/fs/aio.c > index 97f983592925..aac9659381d2 100644 > --- a/fs/aio.c > +++ b/fs/aio.c > @@ -45,6 +45,7 @@ > > #include > #include > +#include > > #include "internal.h" > > @@ -1038,6 +1039,7 @@ static struct kioctx *lookup_ioctx(unsigned long ctx_id) > if (!table || id >= table->nr) > goto out; > > + id = array_index_nospec(id, table->nr); > ctx = rcu_dereference(table->table[id]); > if (ctx && ctx->user_id == ctx_id) { > if (percpu_ref_tryget_live(&ctx->users)) > > -- > To unsubscribe, send a message with 'unsubscribe linux-aio' in > the body to majord...@kvack.org. For more info on Linux AIO, > see: http://www.kvack.org/aio/ > Don't email: mailto:"a...@kvack.org";>a...@kvack.org
Re: [PATCH] aio: Convert ioctx_table to XArray
Jeff Moyer writes: > Matthew Wilcox writes: > >> This custom resizing array was vulnerable to a Spectre attack (speculating >> off the end of an array to a user-controlled offset). The XArray is >> not vulnerable to Spectre as it always masks its lookups to be within >> the bounds of the array. > > I'm not a big fan of completely re-writing the code to fix this. Isn't > the below patch sufficient? Too quick on the draw. Here's a patch that compiles. ;-) Cheers, Jeff diff --git a/fs/aio.c b/fs/aio.c index 97f983592925..aac9659381d2 100644 --- a/fs/aio.c +++ b/fs/aio.c @@ -45,6 +45,7 @@ #include #include +#include #include "internal.h" @@ -1038,6 +1039,7 @@ static struct kioctx *lookup_ioctx(unsigned long ctx_id) if (!table || id >= table->nr) goto out; + id = array_index_nospec(id, table->nr); ctx = rcu_dereference(table->table[id]); if (ctx && ctx->user_id == ctx_id) { if (percpu_ref_tryget_live(&ctx->users))
Re: [PATCH] aio: Convert ioctx_table to XArray
Matthew Wilcox writes: > This custom resizing array was vulnerable to a Spectre attack (speculating > off the end of an array to a user-controlled offset). The XArray is > not vulnerable to Spectre as it always masks its lookups to be within > the bounds of the array. I'm not a big fan of completely re-writing the code to fix this. Isn't the below patch sufficient? -Jeff diff --git a/fs/aio.c b/fs/aio.c index 97f983592925..9402ae0b63d5 100644 --- a/fs/aio.c +++ b/fs/aio.c @@ -1038,6 +1038,7 @@ static struct kioctx *lookup_ioctx(unsigned long ctx_id) if (!table || id >= table->nr) goto out; + id = array_index_nospec(index, table->nr); ctx = rcu_dereference(table->table[id]); if (ctx && ctx->user_id == ctx_id) { if (percpu_ref_tryget_live(&ctx->users))
Re: Very slow SYS_io_destroy()
Alex Richman writes: > Hi, > > I'm seeing some weirdness with AIO, specifically SYS_io_destroy() is > taking upwards of ~10 microseconds (~100 miliseconds) per call. > How long is that call expected to take? I can see from the source Well, it waits for an RCU grace period. I would not expect that to take 100ms, though. What kernel version is this? -Jeff
Re: [PATCH v6 4/5] fs: blkdev set bio prio from kiocb prio
adam.manzana...@wdc.com writes: > From: Adam Manzanares > > Now that kiocb has an ioprio field copy this over to the bio when it is > created from the kiocb. > > Signed-off-by: Adam Manzanares Reviewed-by: Jeff Moyer Thanks! Jeff > --- > fs/block_dev.c | 2 ++ > 1 file changed, 2 insertions(+) > > diff --git a/fs/block_dev.c b/fs/block_dev.c > index 7ec920e27065..11ba99e79d2a 100644 > --- a/fs/block_dev.c > +++ b/fs/block_dev.c > @@ -216,6 +216,7 @@ __blkdev_direct_IO_simple(struct kiocb *iocb, struct > iov_iter *iter, > bio.bi_write_hint = iocb->ki_hint; > bio.bi_private = current; > bio.bi_end_io = blkdev_bio_end_io_simple; > + bio.bi_ioprio = iocb->ki_ioprio; > > ret = bio_iov_iter_get_pages(&bio, iter); > if (unlikely(ret)) > @@ -355,6 +356,7 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter > *iter, int nr_pages) > bio->bi_write_hint = iocb->ki_hint; > bio->bi_private = dio; > bio->bi_end_io = blkdev_bio_end_io; > + bio->bi_ioprio = iocb->ki_ioprio; > > ret = bio_iov_iter_get_pages(bio, iter); > if (unlikely(ret)) {
Re: [PATCH v5 4/5] fs: blkdev set bio prio from kiocb prio
adam.manzana...@wdc.com writes: > From: Adam Manzanares > > Now that kiocb has an ioprio field copy this over to the bio when it is > created from the kiocb. > > Signed-off-by: Adam Manzanares > --- > fs/block_dev.c | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/fs/block_dev.c b/fs/block_dev.c > index 7ec920e27065..da1e94d2bb75 100644 > --- a/fs/block_dev.c > +++ b/fs/block_dev.c > @@ -355,6 +355,7 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter > *iter, int nr_pages) > bio->bi_write_hint = iocb->ki_hint; > bio->bi_private = dio; > bio->bi_end_io = blkdev_bio_end_io; > + bio->bi_ioprio = iocb->ki_ioprio; > > ret = bio_iov_iter_get_pages(bio, iter); > if (unlikely(ret)) { Forgot to mention, you should also convert __blkdev_direct_IO_simple. -Jeff
Re: [PATCH v5 3/5] fs: Add aio iopriority support
adam.manzana...@wdc.com writes: > From: Adam Manzanares > > This is the per-I/O equivalent of the ioprio_set system call. > > When IOCB_FLAG_IOPRIO is set on the iocb aio_flags field, then we set the > newly added kiocb ki_ioprio field to the value in the iocb aio_reqprio field. > > We set the blkdev bio iopriority unconditionally, so we need to guarantee the > kiocb is initialized properly. Added changes to the loopback driver and > init_sync_kiocb to achieve this. > > This patch depends on block: add ioprio_check_cap function. > > Signed-off-by: Adam Manzanares Reviewed-by: Jeff Moyer > --- > drivers/block/loop.c | 3 +++ > fs/aio.c | 16 > include/linux/fs.h | 3 +++ > include/uapi/linux/aio_abi.h | 1 + > 4 files changed, 23 insertions(+) > > diff --git a/drivers/block/loop.c b/drivers/block/loop.c > index 5d4e31655d96..dd98dfd97f5e 100644 > --- a/drivers/block/loop.c > +++ b/drivers/block/loop.c > @@ -76,6 +76,8 @@ > #include > #include > #include > +#include > + > #include "loop.h" > > #include > @@ -559,6 +561,7 @@ static int lo_rw_aio(struct loop_device *lo, struct > loop_cmd *cmd, > cmd->iocb.ki_filp = file; > cmd->iocb.ki_complete = lo_rw_aio_complete; > cmd->iocb.ki_flags = IOCB_DIRECT; > + cmd->iocb.ki_ioprio = IOPRIO_PRIO_VALUE(IOPRIO_CLASS_NONE, 0); > if (cmd->css) > kthread_associate_blkcg(cmd->css); > > diff --git a/fs/aio.c b/fs/aio.c > index f3eae5d5771b..44b4572be524 100644 > --- a/fs/aio.c > +++ b/fs/aio.c > @@ -1451,6 +1451,22 @@ static int aio_prep_rw(struct kiocb *req, struct iocb > *iocb) > if (iocb->aio_flags & IOCB_FLAG_RESFD) > req->ki_flags |= IOCB_EVENTFD; > req->ki_hint = file_write_hint(req->ki_filp); > + if (iocb->aio_flags & IOCB_FLAG_IOPRIO) { > + /* > + * If the IOCB_FLAG_IOPRIO flag of aio_flags is set, then > + * aio_reqprio is interpreted as an I/O scheduling > + * class and priority. > + */ > + ret = ioprio_check_cap(iocb->aio_reqprio); > + if (ret) { > + pr_debug("aio ioprio check cap error\n"); > + return -EINVAL; > + } > + > + req->ki_ioprio = iocb->aio_reqprio; > + } else > + req->ki_ioprio = IOPRIO_PRIO_VALUE(IOPRIO_CLASS_NONE, 0); > + > ret = kiocb_set_rw_flags(req, iocb->aio_rw_flags); > if (unlikely(ret)) > fput(req->ki_filp); > diff --git a/include/linux/fs.h b/include/linux/fs.h > index 50de40dbbb85..73b749ed3ea1 100644 > --- a/include/linux/fs.h > +++ b/include/linux/fs.h > @@ -36,6 +36,7 @@ > #include > #include > #include > +#include > > #include > #include > @@ -302,6 +303,7 @@ struct kiocb { > void*private; > int ki_flags; > u16 ki_hint; > + u16 ki_ioprio; /* See linux/ioprio.h */ > } __randomize_layout; > > static inline bool is_sync_kiocb(struct kiocb *kiocb) > @@ -1942,6 +1944,7 @@ static inline void init_sync_kiocb(struct kiocb *kiocb, > struct file *filp) > .ki_filp = filp, > .ki_flags = iocb_flags(filp), > .ki_hint = ki_hint_validate(file_write_hint(filp)), > + .ki_ioprio = IOPRIO_PRIO_VALUE(IOPRIO_CLASS_NONE, 0), > }; > } > > diff --git a/include/uapi/linux/aio_abi.h b/include/uapi/linux/aio_abi.h > index 2c0a3415beee..d4e768d55d14 100644 > --- a/include/uapi/linux/aio_abi.h > +++ b/include/uapi/linux/aio_abi.h > @@ -55,6 +55,7 @@ enum { > * is valid. > */ > #define IOCB_FLAG_RESFD (1 << 0) > +#define IOCB_FLAG_IOPRIO (1 << 1) > > /* read() from /dev/aio returns these structures. */ > struct io_event {
Re: [PATCH v5 5/5] fs: iomap dio set bio prio from kiocb prio
adam.manzana...@wdc.com writes: > From: Adam Manzanares > > Now that kiocb has an ioprio field copy this over to the bio when it is > created from the kiocb during direct IO. > > Signed-off-by: Adam Manzanares Reviewed-by: Jeff Moyer > --- > fs/iomap.c | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/fs/iomap.c b/fs/iomap.c > index afd163586aa0..65aae194aeca 100644 > --- a/fs/iomap.c > +++ b/fs/iomap.c > @@ -919,6 +919,7 @@ iomap_dio_actor(struct inode *inode, loff_t pos, loff_t > length, > bio->bi_iter.bi_sector = > (iomap->addr + pos - iomap->offset) >> 9; > bio->bi_write_hint = dio->iocb->ki_hint; > + bio->bi_ioprio = dio->iocb->ki_ioprio; > bio->bi_private = dio; > bio->bi_end_io = iomap_dio_bio_end_io;
Re: [PATCH v5 4/5] fs: blkdev set bio prio from kiocb prio
adam.manzana...@wdc.com writes: > From: Adam Manzanares > > Now that kiocb has an ioprio field copy this over to the bio when it is > created from the kiocb. > > Signed-off-by: Adam Manzanares Reviewed-by: Jeff Moyer > --- > fs/block_dev.c | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/fs/block_dev.c b/fs/block_dev.c > index 7ec920e27065..da1e94d2bb75 100644 > --- a/fs/block_dev.c > +++ b/fs/block_dev.c > @@ -355,6 +355,7 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter > *iter, int nr_pages) > bio->bi_write_hint = iocb->ki_hint; > bio->bi_private = dio; > bio->bi_end_io = blkdev_bio_end_io; > + bio->bi_ioprio = iocb->ki_ioprio; > > ret = bio_iov_iter_get_pages(bio, iter); > if (unlikely(ret)) {
Re: [PATCH v5 1/5] block: add ioprio_check_cap function
adam.manzana...@wdc.com writes: > From: Adam Manzanares > > Aio per command iopriority support introduces a second interface between > userland and the kernel capable of passing iopriority. The aio interface also > needs the ability to verify that the submitting context has sufficient > priviledges to submit IOPRIO_RT commands. This patch creates the > ioprio_check_cap function to be used by the ioprio_set system call and also by > the aio interface. > > Signed-off-by: Adam Manzanares > Reviewed-by: Christoph Hellwig Reviewed-by: Jeff Moyer > --- > block/ioprio.c | 22 -- > include/linux/ioprio.h | 2 ++ > 2 files changed, 18 insertions(+), 6 deletions(-) > > diff --git a/block/ioprio.c b/block/ioprio.c > index 6f5d0b6625e3..f9821080c92c 100644 > --- a/block/ioprio.c > +++ b/block/ioprio.c > @@ -61,15 +61,10 @@ int set_task_ioprio(struct task_struct *task, int ioprio) > } > EXPORT_SYMBOL_GPL(set_task_ioprio); > > -SYSCALL_DEFINE3(ioprio_set, int, which, int, who, int, ioprio) > +int ioprio_check_cap(int ioprio) > { > int class = IOPRIO_PRIO_CLASS(ioprio); > int data = IOPRIO_PRIO_DATA(ioprio); > - struct task_struct *p, *g; > - struct user_struct *user; > - struct pid *pgrp; > - kuid_t uid; > - int ret; > > switch (class) { > case IOPRIO_CLASS_RT: > @@ -92,6 +87,21 @@ SYSCALL_DEFINE3(ioprio_set, int, which, int, who, int, > ioprio) > return -EINVAL; > } > > + return 0; > +} > + > +SYSCALL_DEFINE3(ioprio_set, int, which, int, who, int, ioprio) > +{ > + struct task_struct *p, *g; > + struct user_struct *user; > + struct pid *pgrp; > + kuid_t uid; > + int ret; > + > + ret = ioprio_check_cap(ioprio); > + if (ret) > + return ret; > + > ret = -ESRCH; > rcu_read_lock(); > switch (which) { > diff --git a/include/linux/ioprio.h b/include/linux/ioprio.h > index 627efac73e6d..4a28cec49ec3 100644 > --- a/include/linux/ioprio.h > +++ b/include/linux/ioprio.h > @@ -77,4 +77,6 @@ extern int ioprio_best(unsigned short aprio, unsigned short > bprio); > > extern int set_task_ioprio(struct task_struct *task, int ioprio); > > +extern int ioprio_check_cap(int ioprio); > + > #endif
Re: [PATCH v5 0/5] AIO add per-command iopriority
Hi, Adam, adam.manzana...@wdc.com writes: > From: Adam Manzanares > > This is the per-I/O equivalent of the ioprio_set system call. > See the following link for performance implications on a SATA HDD: > https://lkml.org/lkml/2016/12/6/495 > > First patch factors ioprio_check_cap function out of ioprio_set system call to > also be used by the aio ioprio interface. > > Second patch converts kiocb ki_hint field to a u16 to avoid kiocb bloat. > > Third patch passes ioprio hint from aio iocb to kiocb and inititalizes kiocb > ioprio value appropriately when it is not explicitly set. > > Fourth patch enables the feature for blkdev. > > Fifth patch enables the feature for iomap direct IO > > Note: this work is based on top of linux-vfs/for-next I'll cook up a libaio test case. Can you put together a man-pages update for this? Thanks! Jeff
Re: [RFC PATCH v1 0/6] use mm to manage NVDIMM (pmem) zone
Dan Williams writes: > On Mon, May 7, 2018 at 12:08 PM, Jeff Moyer wrote: >> Dan Williams writes: >> >>> On Mon, May 7, 2018 at 11:46 AM, Matthew Wilcox wrote: >>>> On Mon, May 07, 2018 at 10:50:21PM +0800, Huaisheng Ye wrote: >>>>> Traditionally, NVDIMMs are treated by mm(memory management) subsystem as >>>>> DEVICE zone, which is a virtual zone and both its start and end of pfn >>>>> are equal to 0, mm wouldn’t manage NVDIMM directly as DRAM, kernel uses >>>>> corresponding drivers, which locate at \drivers\nvdimm\ and >>>>> \drivers\acpi\nfit and fs, to realize NVDIMM memory alloc and free with >>>>> memory hot plug implementation. >>>> >>>> You probably want to let linux-nvdimm know about this patch set. >>>> Adding to the cc. >>> >>> Yes, thanks for that! >>> >>>> Also, I only received patch 0 and 4. What happened >>>> to 1-3,5 and 6? >>>> >>>>> With current kernel, many mm’s classical features like the buddy >>>>> system, swap mechanism and page cache couldn’t be supported to NVDIMM. >>>>> What we are doing is to expand kernel mm’s capacity to make it to handle >>>>> NVDIMM like DRAM. Furthermore we make mm could treat DRAM and NVDIMM >>>>> separately, that means mm can only put the critical pages to NVDIMM >> >> Please define "critical pages." >> >>>>> zone, here we created a new zone type as NVM zone. That is to say for >>>>> traditional(or normal) pages which would be stored at DRAM scope like >>>>> Normal, DMA32 and DMA zones. But for the critical pages, which we hope >>>>> them could be recovered from power fail or system crash, we make them >>>>> to be persistent by storing them to NVM zone. >> >> [...] >> >>> I think adding yet one more mm-zone is the wrong direction. Instead, >>> what we have been considering is a mechanism to allow a device-dax >>> instance to be given back to the kernel as a distinct numa node >>> managed by the VM. It seems it times to dust off those patches. >> >> What's the use case? > > Use NVDIMMs as System-RAM given their potentially higher capacity than > DDR. The expectation in that case is that data is forfeit (not > persisted) after a crash. Any persistent use case would need to go > through the pmem driver, filesystem-dax or device-dax. OK, but that sounds different from what was being proposed, here. I'll quote from above: >>>>> But for the critical pages, which we hope them could be recovered ^ >>>>> from power fail or system crash, we make them to be persistent by ^^^ >>>>> storing them to NVM zone. Hence my confusion. Cheers, Jeff
Re: [RFC PATCH v1 0/6] use mm to manage NVDIMM (pmem) zone
Dan Williams writes: > On Mon, May 7, 2018 at 11:46 AM, Matthew Wilcox wrote: >> On Mon, May 07, 2018 at 10:50:21PM +0800, Huaisheng Ye wrote: >>> Traditionally, NVDIMMs are treated by mm(memory management) subsystem as >>> DEVICE zone, which is a virtual zone and both its start and end of pfn >>> are equal to 0, mm wouldn’t manage NVDIMM directly as DRAM, kernel uses >>> corresponding drivers, which locate at \drivers\nvdimm\ and >>> \drivers\acpi\nfit and fs, to realize NVDIMM memory alloc and free with >>> memory hot plug implementation. >> >> You probably want to let linux-nvdimm know about this patch set. >> Adding to the cc. > > Yes, thanks for that! > >> Also, I only received patch 0 and 4. What happened >> to 1-3,5 and 6? >> >>> With current kernel, many mm’s classical features like the buddy >>> system, swap mechanism and page cache couldn’t be supported to NVDIMM. >>> What we are doing is to expand kernel mm’s capacity to make it to handle >>> NVDIMM like DRAM. Furthermore we make mm could treat DRAM and NVDIMM >>> separately, that means mm can only put the critical pages to NVDIMM Please define "critical pages." >>> zone, here we created a new zone type as NVM zone. That is to say for >>> traditional(or normal) pages which would be stored at DRAM scope like >>> Normal, DMA32 and DMA zones. But for the critical pages, which we hope >>> them could be recovered from power fail or system crash, we make them >>> to be persistent by storing them to NVM zone. [...] > I think adding yet one more mm-zone is the wrong direction. Instead, > what we have been considering is a mechanism to allow a device-dax > instance to be given back to the kernel as a distinct numa node > managed by the VM. It seems it times to dust off those patches. What's the use case? The above patch description seems to indicate an intent to recover contents after a power loss. Without seeing the whole series, I'm not sure how that's accomplished in a safe or meaningful way. Huaisheng, could you provide a bit more background? Thanks! Jeff
Re: [PATCH v2] fs: Add aio iopriority support for block_dev
Hi, Adam, adam.manzana...@wdc.com writes: > From: Adam Manzanares > > This is the per-I/O equivalent of the ioprio_set system call. > > When IOCB_FLAG_IOPRIO is set on the iocb aio_flags field, then we set the > newly added kiocb ki_ioprio field to the value in the iocb aio_reqprio field. > > When a bio is created for an aio request by the block dev we set the priority > value of the bio to the user supplied value. > > See the following link for performance implications on a SATA HDD: > https://lkml.org/lkml/2016/12/6/495 > > Given that WRR support for NVME devices has patches floating around and it was > discussed at LSFMM, we may soon have a lower latency storage device that can > make use of iopriorities. A per command iopriority interface seems timely > given these developments. > > If we want to avoid bloating struct kiocb, I suggest we turn the private > field > into a union of the private and ki_ioprio field. It seems like the users of > the private field all use it at a point where we can yank the priority from > the kiocb before the private field is used. Comments and suggestions welcome. The ioprio_set system call requires CAP_SYS_ADMIN for setting IOPRIO_CLASS_RT. I think we need similar checks here. -Jeff > > v2: merge patches > use IOCB_FLAG_IOPRIO > validate intended use with IOCB_IOPRIO > add linux-api and linux-block to cc > > Signed-off-by: Adam Manzanares > --- > fs/aio.c | 10 ++ > fs/block_dev.c | 2 ++ > include/linux/fs.h | 2 ++ > include/uapi/linux/aio_abi.h | 1 + > 4 files changed, 15 insertions(+) > > diff --git a/fs/aio.c b/fs/aio.c > index 88d7927ffbc6..f36636d8ff2c 100644 > --- a/fs/aio.c > +++ b/fs/aio.c > @@ -1597,6 +1597,16 @@ static int io_submit_one(struct kioctx *ctx, struct > iocb __user *user_iocb, > req->common.ki_flags |= IOCB_EVENTFD; > } > > + if (iocb->aio_flags & IOCB_FLAG_IOPRIO) { > + /* > + * If the IOCB_FLAG_IOPRIO flag of aio_flags is set, then > + * aio_reqprio is interpreted as an I/O scheduling > + * class and priority. > + */ > + req->common.ki_ioprio = iocb->aio_reqprio; > + req->common.ki_flags |= IOCB_IOPRIO; > + } > + > ret = kiocb_set_rw_flags(&req->common, iocb->aio_rw_flags); > if (unlikely(ret)) { > pr_debug("EINVAL: aio_rw_flags\n"); > diff --git a/fs/block_dev.c b/fs/block_dev.c > index 7ec920e27065..970bef79caa6 100644 > --- a/fs/block_dev.c > +++ b/fs/block_dev.c > @@ -355,6 +355,8 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter > *iter, int nr_pages) > bio->bi_write_hint = iocb->ki_hint; > bio->bi_private = dio; > bio->bi_end_io = blkdev_bio_end_io; > + if (iocb->ki_flags & IOCB_IOPRIO) > + bio->bi_ioprio = iocb->ki_ioprio; > > ret = bio_iov_iter_get_pages(bio, iter); > if (unlikely(ret)) { > diff --git a/include/linux/fs.h b/include/linux/fs.h > index 760d8da1b6c7..ab63ce720305 100644 > --- a/include/linux/fs.h > +++ b/include/linux/fs.h > @@ -292,6 +292,7 @@ enum rw_hint { > #define IOCB_SYNC(1 << 5) > #define IOCB_WRITE (1 << 6) > #define IOCB_NOWAIT (1 << 7) > +#define IOCB_IOPRIO (1 << 8) > > struct kiocb { > struct file *ki_filp; > @@ -300,6 +301,7 @@ struct kiocb { > void*private; > int ki_flags; > enum rw_hintki_hint; > + u16 ki_ioprio; /* See linux/ioprio.h */ > } __randomize_layout; > > static inline bool is_sync_kiocb(struct kiocb *kiocb) > diff --git a/include/uapi/linux/aio_abi.h b/include/uapi/linux/aio_abi.h > index a04adbc70ddf..d4593a6062ef 100644 > --- a/include/uapi/linux/aio_abi.h > +++ b/include/uapi/linux/aio_abi.h > @@ -54,6 +54,7 @@ enum { > * is valid. > */ > #define IOCB_FLAG_RESFD (1 << 0) > +#define IOCB_FLAG_IOPRIO (1 << 1) > > /* read() from /dev/aio returns these structures. */ > struct io_event {
Re: [PATCH] pmem: fix badblocks population for raw mode
Toshi Kani writes: > pmem_attach_disk() calls nvdimm_badblocks_populate() with resource > range uninitialized in the case of raw mode. This leads the pmem > driver to hit MCE despite of ARS reporting the range bad. > > Initialize 'bb_res' for raw mode. > > Fixes: e8d513483300 ("memremap: change devm_memremap_pages interface to use > struct dev_pagemap") > Signed-off-by: Toshi Kani > Cc: Christoph Hellwig > Cc: Dan Williams > Cc: Reviewed-by: Jeff Moyer > --- > drivers/nvdimm/pmem.c |4 +++- > 1 file changed, 3 insertions(+), 1 deletion(-) > > diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c > index 9d714926ecf5..2d7875209bce 100644 > --- a/drivers/nvdimm/pmem.c > +++ b/drivers/nvdimm/pmem.c > @@ -367,9 +367,11 @@ static int pmem_attach_disk(struct device *dev, > addr = devm_memremap_pages(dev, &pmem->pgmap); > pmem->pfn_flags |= PFN_MAP; > memcpy(&bb_res, &pmem->pgmap.res, sizeof(bb_res)); > - } else > + } else { > addr = devm_memremap(dev, pmem->phys_addr, > pmem->size, ARCH_MEMREMAP_PMEM); > + memcpy(&bb_res, res, sizeof(bb_res)); > + } > > /* >* At release time the queue must be frozen before > ___ > Linux-nvdimm mailing list > linux-nvd...@lists.01.org > https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [Qemu-devel] [RFC v2 1/2] virtio: add pmem driver
Pankaj Gupta writes: >> Ideally, qemu (seabios?) would advertise a platform capabilities >> sub-table that doesn't fill in the flush bits. > > Could you please elaborate on this, how its related to disabling > MAP_SYNC? We are not doing entire nvdimm device emulation. My mistake. If you're not providing an NFIT, then you can ignore this comment. I'll have a closer look at your patches next week. -Jeff
Re: [RFC v2 1/2] virtio: add pmem driver
Dan Williams writes: > [ adding Jeff directly since he has also been looking at > infrastructure to track when MAP_SYNC should be disabled ] > > On Wed, Apr 25, 2018 at 7:21 AM, Dan Williams > wrote: >> On Wed, Apr 25, 2018 at 4:24 AM, Pankaj Gupta wrote: >>> This patch adds virtio-pmem driver for KVM >>> guest. >> >> Minor nit, please expand your changelog line wrapping to 72 columns. >> >>> >>> Guest reads the persistent memory range >>> information from Qemu over VIRTIO and registers >>> it on nvdimm_bus. It also creates a nd_region >>> object with the persistent memory range >>> information so that existing 'nvdimm/pmem' >>> driver can reserve this into system memory map. >>> This way 'virtio-pmem' driver uses existing >>> functionality of pmem driver to register persistent >>> memory compatible for DAX capable filesystems. >> >> We need some additional enabling to disable MAP_SYNC for this enable to disable... I like it! ;-) >> configuration. In other words, if fsync() is required then we must >> disable the MAP_SYNC optimization. I think this should be a struct >> dax_device property looked up at mmap time in each MAP_SYNC capable >> ->mmap() file operation implementation. Ideally, qemu (seabios?) would advertise a platform capabilities sub-table that doesn't fill in the flush bits. -Jeff
Re: io_pgetevents & aio fsync V2
Christoph Hellwig writes: > On Fri, Apr 06, 2018 at 04:16:30AM +0100, Al Viro wrote: >> BTW, this is only tangentially related, but... does *anything* call >> io_submit() for huge amounts of iocb? I don't know. If an application did that, as many I/Os as could fit into the ring buffer would be submitted, and that's what gets returned from the system call (the number of submitted iocbs). >> Check in do_io_submit() is insane - "no more than MAX_LONG total of >> _pointers_". Compat variant goes for "no more than a page worth of >> pointers" and there's a hard limit in ioctx_alloc() - we can't ever >> get more than 8M slots in ring buffer... > > Logical upper bound for io_submit is nr_events passed to io_setup(), > which is bound by aio_max_nr. Except that we never actually check > against nr_events (or max_reqs as it is known in kernel) in io_submit. > Sigh.. io_submit_one calls aio_get_req which calls get_reqs_available, which is what does the checking for an available ring buffer entry. -Jeff
Re: [PATCH 08/36] aio: implement io_pgetevents
Christoph Hellwig writes: > On Tue, Mar 20, 2018 at 11:30:29AM -0400, Jeff Moyer wrote: >> > In this commit: >> > >> > http://git.infradead.org/users/hch/libaio.git/commitdiff/49f77d595210393ce7b125cb00233cf737402f56 >> >> BTW, the man pages are shipped in the man-pages package, now. >> Christoph, I can forward the change to Michael after this series goes >> in. Just let me know what's easiest for you. > > I'd be more than happy to send patch to Michael if we get these > patches merged finally :) hahaha, least of your problems, I know. :)
Re: [PATCH 08/36] aio: implement io_pgetevents
Christoph Hellwig writes: > On Mon, Mar 19, 2018 at 07:12:41PM -0700, Darrick J. Wong wrote: >> > Note that unlike many other signal related calls we do not pass a sigmask >> > size, as that would get us to 7 arguments, which aren't easily supported >> > by the syscall infrastructure. It seems a lot less painful to just add a >> > new syscall variant in the unlikely case we're going to increase the >> > sigset size. >> >> I'm assuming there's a proposed manpage update for this somewhere? :) > > In this commit: > > http://git.infradead.org/users/hch/libaio.git/commitdiff/49f77d595210393ce7b125cb00233cf737402f56 BTW, the man pages are shipped in the man-pages package, now. Christoph, I can forward the change to Michael after this series goes in. Just let me know what's easiest for you. Cheers, Jeff
Re: [PATCH 14/36] aio: implement IOCB_CMD_POLL
Christoph Hellwig writes: > Simple one-shot poll through the io_submit() interface. To poll for > a file descriptor the application should submit an iocb of type > IOCB_CMD_POLL. It will poll the fd for the events specified in the > the first 32 bits of the aio_buf field of the iocb. > > Unlike poll or epoll without EPOLLONESHOT this interface always works > in one shot mode, that is once the iocb is completed, it will have to be > resubmitted. > > Signed-off-by: Christoph Hellwig Also acked this one in the last posting. Acked-by: Jeff Moyer > --- > fs/aio.c | 102 > +++ > include/uapi/linux/aio_abi.h | 6 +-- > 2 files changed, 104 insertions(+), 4 deletions(-) > > diff --git a/fs/aio.c b/fs/aio.c > index da87cbf7c67a..0bafc4975d51 100644 > --- a/fs/aio.c > +++ b/fs/aio.c > @@ -5,6 +5,7 @@ > * Implements an efficient asynchronous io interface. > * > * Copyright 2000, 2001, 2002 Red Hat, Inc. All Rights Reserved. > + * Copyright 2018 Christoph Hellwig. > * > * See ../COPYING for licensing terms. > */ > @@ -156,9 +157,17 @@ struct kioctx { > unsignedid; > }; > > +struct poll_iocb { > + struct file *file; > + __poll_tevents; > + struct wait_queue_head *head; > + struct wait_queue_entry wait; > +}; > + > struct aio_kiocb { > union { > struct kiocbrw; > + struct poll_iocbpoll; > }; > > struct kioctx *ki_ctx; > @@ -1565,6 +1574,96 @@ static ssize_t aio_write(struct kiocb *req, struct > iocb *iocb, bool vectored, > return ret; > } > > +static void __aio_complete_poll(struct poll_iocb *req, __poll_t mask) > +{ > + fput(req->file); > + aio_complete(container_of(req, struct aio_kiocb, poll), > + mangle_poll(mask), 0); > +} > + > +static void aio_complete_poll(struct poll_iocb *req, __poll_t mask) > +{ > + struct aio_kiocb *iocb = container_of(req, struct aio_kiocb, poll); > + > + if (!(iocb->flags & AIO_IOCB_CANCELLED)) > + __aio_complete_poll(req, mask); > +} > + > +static int aio_poll_cancel(struct kiocb *rw) > +{ > + struct aio_kiocb *iocb = container_of(rw, struct aio_kiocb, rw); > + > + remove_wait_queue(iocb->poll.head, &iocb->poll.wait); > + __aio_complete_poll(&iocb->poll, 0); /* no events to report */ > + return 0; > +} > + > +static int aio_poll_wake(struct wait_queue_entry *wait, unsigned mode, int > sync, > + void *key) > +{ > + struct poll_iocb *req = container_of(wait, struct poll_iocb, wait); > + struct file *file = req->file; > + __poll_t mask = key_to_poll(key); > + > + assert_spin_locked(&req->head->lock); > + > + /* for instances that support it check for an event match first: */ > + if (mask && !(mask & req->events)) > + return 0; > + > + mask = vfs_poll_mask(file, req->events); > + if (!mask) > + return 0; > + > + __remove_wait_queue(req->head, &req->wait); > + aio_complete_poll(req, mask); > + return 1; > +} > + > +static ssize_t aio_poll(struct aio_kiocb *aiocb, struct iocb *iocb) > +{ > + struct poll_iocb *req = &aiocb->poll; > + unsigned long flags; > + __poll_t mask; > + > + /* reject any unknown events outside the normal event mask. */ > + if ((u16)iocb->aio_buf != iocb->aio_buf) > + return -EINVAL; > + /* reject fields that are not defined for poll */ > + if (iocb->aio_offset || iocb->aio_nbytes || iocb->aio_rw_flags) > + return -EINVAL; > + > + req->events = demangle_poll(iocb->aio_buf) | POLLERR | POLLHUP; > + req->file = fget(iocb->aio_fildes); > + if (unlikely(!req->file)) > + return -EBADF; > + > + req->head = vfs_get_poll_head(req->file, req->events); > + if (!req->head) { > + fput(req->file); > + return -EINVAL; /* same as no support for IOCB_CMD_POLL */ > + } > + if (IS_ERR(req->head)) { > + mask = PTR_TO_POLL(req->head); > + goto done; > + } > + > + init_waitqueue_func_entry(&req->wait, aio_poll_wake); > + > + spin_lock_irqsave(&req->head->lock, flags); > + mask = vfs_poll_mask(req->file, req->events); > + if (!mask) { > + __kiocb_set_cancel_fn(aiocb, aio_poll_cancel, >
Re: [PATCH 08/36] aio: implement io_pgetevents
Christoph Hellwig writes: > This is the io_getevents equivalent of ppoll/pselect and allows to > properly mix signals and aio completions (especially with IOCB_CMD_POLL) > and atomically executes the following sequence: > > sigset_t origmask; > > pthread_sigmask(SIG_SETMASK, &sigmask, &origmask); > ret = io_getevents(ctx, min_nr, nr, events, timeout); > pthread_sigmask(SIG_SETMASK, &origmask, NULL); > > Note that unlike many other signal related calls we do not pass a sigmask > size, as that would get us to 7 arguments, which aren't easily supported > by the syscall infrastructure. It seems a lot less painful to just add a > new syscall variant in the unlikely case we're going to increase the > sigset size. > > Signed-off-by: Christoph Hellwig I acked this in the last set, so... Acked-by: Jeff Moyer > --- > arch/x86/entry/syscalls/syscall_32.tbl | 1 + > arch/x86/entry/syscalls/syscall_64.tbl | 1 + > fs/aio.c | 114 > ++--- > include/linux/compat.h | 7 ++ > include/linux/syscalls.h | 6 ++ > include/uapi/asm-generic/unistd.h | 4 +- > include/uapi/linux/aio_abi.h | 6 ++ > kernel/sys_ni.c| 2 + > 8 files changed, 130 insertions(+), 11 deletions(-) > > diff --git a/arch/x86/entry/syscalls/syscall_32.tbl > b/arch/x86/entry/syscalls/syscall_32.tbl > index 448ac2161112..5997c3e9ac3e 100644 > --- a/arch/x86/entry/syscalls/syscall_32.tbl > +++ b/arch/x86/entry/syscalls/syscall_32.tbl > @@ -391,3 +391,4 @@ > 382 i386pkey_free sys_pkey_free > 383 i386statx sys_statx > 384 i386arch_prctl sys_arch_prctl > compat_sys_arch_prctl > +385 i386io_pgetevents sys_io_pgetevents > compat_sys_io_pgetevents > diff --git a/arch/x86/entry/syscalls/syscall_64.tbl > b/arch/x86/entry/syscalls/syscall_64.tbl > index 5aef183e2f85..e995cd2b4e65 100644 > --- a/arch/x86/entry/syscalls/syscall_64.tbl > +++ b/arch/x86/entry/syscalls/syscall_64.tbl > @@ -339,6 +339,7 @@ > 330 common pkey_alloc sys_pkey_alloc > 331 common pkey_free sys_pkey_free > 332 common statx sys_statx > +333 common io_pgetevents sys_io_pgetevents > > # > # x32-specific system call numbers start at 512 to avoid cache impact > diff --git a/fs/aio.c b/fs/aio.c > index 9d7d6e4cde87..da87cbf7c67a 100644 > --- a/fs/aio.c > +++ b/fs/aio.c > @@ -1291,10 +1291,6 @@ static long read_events(struct kioctx *ctx, long > min_nr, long nr, > wait_event_interruptible_hrtimeout(ctx->wait, > aio_read_events(ctx, min_nr, nr, event, &ret), > until); > - > - if (!ret && signal_pending(current)) > - ret = -EINTR; > - > return ret; > } > > @@ -1874,13 +1870,60 @@ SYSCALL_DEFINE5(io_getevents, aio_context_t, ctx_id, > struct timespec __user *, timeout) > { > struct timespec64 ts; > + int ret; > + > + if (timeout && unlikely(get_timespec64(&ts, timeout))) > + return -EFAULT; > + > + ret = do_io_getevents(ctx_id, min_nr, nr, events, timeout ? &ts : NULL); > + if (!ret && signal_pending(current)) > + ret = -EINTR; > + return ret; > +} > + > +SYSCALL_DEFINE6(io_pgetevents, > + aio_context_t, ctx_id, > + long, min_nr, > + long, nr, > + struct io_event __user *, events, > + struct timespec __user *, timeout, > + const struct __aio_sigset __user *, usig) > +{ > + struct __aio_sigset ksig = { NULL, }; > + sigset_tksigmask, sigsaved; > + struct timespec64 ts; > + int ret; > + > + if (timeout && unlikely(get_timespec64(&ts, timeout))) > + return -EFAULT; > > - if (timeout) { > - if (unlikely(get_timespec64(&ts, timeout))) > + if (usig && copy_from_user(&ksig, usig, sizeof(ksig))) > + return -EFAULT; > + > + if (ksig.sigmask) { > + if (ksig.sigsetsize != sizeof(sigset_t)) > + return -EINVAL; > + if (copy_from_user(&ksigmask, ksig.sigmask, sizeof(ksigmask))) > return -EFAULT; > + sigdelsetmask(&ksigmask, sigmask(SIGKILL) | sigmask(SIGSTOP)); > + sigprocmask(SIG_SETMASK, &ks
Re: [PATCH v2] libnvdimm: re-enable deep flush for pmem devices
Dan Williams writes: > On Mon, Feb 12, 2018 at 2:53 PM, Jeff Moyer wrote: >> Dave Jiang writes: >> >>> Re-enable deep flush so that users always have a way to be sure that a write >>> does make it all the way out to the NVDIMM. The PMEM driver writes always >>> make it "all the way to the NVDIMM", and it relies on the ADR mechanism to >>> flush the write buffers on power failure. Deep flush is there to explicitly >>> flush those write buffers to protect against (rare) ADR failure. >>> This change prevents a regression in deep flush behavior so that >>> applications >>> can continue to depend on fsync() as a mechanism to trigger deep flush in >>> the >>> filesystem-dax case. >> >> That's still very confusing text. Specifically, the part where you say >> that pmem driver writes always make it to the DIMM. I think the >> changelog could start with "Deep flush is there to explicitly flush >> write buffers" Anyway, the fix looks right to me. > > I ended up changing the commit message to this, let me know if it reads > better: Thanks. It's still unclear to me what the text, "The PMEM driver writes always arrive at the NVDIMM" means. However, it's good enough. Thanks! Jeff > > libnvdimm: re-enable deep flush for pmem devices via fsync() > > Re-enable deep flush so that users always have a way to be sure that a > write makes it all the way out to media. The PMEM driver writes always > arrive at the NVDIMM, and it relies on the ADR (Asynchronous DRAM > Refresh) mechanism to flush the write buffers on power failure. Deep > flush is there to explicitly flush those write buffers to protect > against (rare) ADR failure. This change prevents a regression in deep > flush behavior so that applications can continue to depend on fsync() as > a mechanism to trigger deep flush in the filesystem-DAX case. > > Fixes: 06e8ccdab15f4 ("acpi: nfit: Add support for detect platform > CPU cache...") > Signed-off-by: Dave Jiang > Signed-off-by: Dan Williams
Re: [PATCH v2] libnvdimm: re-enable deep flush for pmem devices
Dave Jiang writes: > Re-enable deep flush so that users always have a way to be sure that a write > does make it all the way out to the NVDIMM. The PMEM driver writes always > make it "all the way to the NVDIMM", and it relies on the ADR mechanism to > flush the write buffers on power failure. Deep flush is there to explicitly > flush those write buffers to protect against (rare) ADR failure. > This change prevents a regression in deep flush behavior so that applications > can continue to depend on fsync() as a mechanism to trigger deep flush in the > filesystem-dax case. That's still very confusing text. Specifically, the part where you say that pmem driver writes always make it to the DIMM. I think the changelog could start with "Deep flush is there to explicitly flush write buffers" Anyway, the fix looks right to me. Reviewed-by: Jeff Moyer
Re: [PATCH 08/36] aio: implement io_pgetevents
Christoph Hellwig writes: > This is the io_getevents equivalent of ppoll/pselect and allows to > properly mix signals and aio completions (especially with IOCB_CMD_POLL) > and atomically executes the following sequence: > > sigset_t origmask; > > pthread_sigmask(SIG_SETMASK, &sigmask, &origmask); > ret = io_getevents(ctx, min_nr, nr, events, timeout); > pthread_sigmask(SIG_SETMASK, &origmask, NULL); > > Note that unlike many other signal related calls we do not pass a sigmask > size, as that would get us to 7 arguments, which aren't easily supported > by the syscall infrastructure. It seems a lot less painful to just add a > new syscall variant in the unlikely case we're going to increase the > sigset size. > > Signed-off-by: Christoph Hellwig Acked-by: Jeff Moyer I assume other arch maintainers will also plumb this in? -Jeff > --- > arch/x86/entry/syscalls/syscall_32.tbl | 1 + > arch/x86/entry/syscalls/syscall_64.tbl | 1 + > fs/aio.c | 114 > ++--- > include/linux/compat.h | 7 ++ > include/linux/syscalls.h | 6 ++ > include/uapi/asm-generic/unistd.h | 4 +- > include/uapi/linux/aio_abi.h | 6 ++ > kernel/sys_ni.c| 2 + > 8 files changed, 130 insertions(+), 11 deletions(-) > > diff --git a/arch/x86/entry/syscalls/syscall_32.tbl > b/arch/x86/entry/syscalls/syscall_32.tbl > index 448ac2161112..5997c3e9ac3e 100644 > --- a/arch/x86/entry/syscalls/syscall_32.tbl > +++ b/arch/x86/entry/syscalls/syscall_32.tbl > @@ -391,3 +391,4 @@ > 382 i386pkey_free sys_pkey_free > 383 i386statx sys_statx > 384 i386arch_prctl sys_arch_prctl > compat_sys_arch_prctl > +385 i386io_pgetevents sys_io_pgetevents > compat_sys_io_pgetevents > diff --git a/arch/x86/entry/syscalls/syscall_64.tbl > b/arch/x86/entry/syscalls/syscall_64.tbl > index 5aef183e2f85..e995cd2b4e65 100644 > --- a/arch/x86/entry/syscalls/syscall_64.tbl > +++ b/arch/x86/entry/syscalls/syscall_64.tbl > @@ -339,6 +339,7 @@ > 330 common pkey_alloc sys_pkey_alloc > 331 common pkey_free sys_pkey_free > 332 common statx sys_statx > +333 common io_pgetevents sys_io_pgetevents > > # > # x32-specific system call numbers start at 512 to avoid cache impact > diff --git a/fs/aio.c b/fs/aio.c > index 9d7d6e4cde87..da87cbf7c67a 100644 > --- a/fs/aio.c > +++ b/fs/aio.c > @@ -1291,10 +1291,6 @@ static long read_events(struct kioctx *ctx, long > min_nr, long nr, > wait_event_interruptible_hrtimeout(ctx->wait, > aio_read_events(ctx, min_nr, nr, event, &ret), > until); > - > - if (!ret && signal_pending(current)) > - ret = -EINTR; > - > return ret; > } > > @@ -1874,13 +1870,60 @@ SYSCALL_DEFINE5(io_getevents, aio_context_t, ctx_id, > struct timespec __user *, timeout) > { > struct timespec64 ts; > + int ret; > + > + if (timeout && unlikely(get_timespec64(&ts, timeout))) > + return -EFAULT; > + > + ret = do_io_getevents(ctx_id, min_nr, nr, events, timeout ? &ts : NULL); > + if (!ret && signal_pending(current)) > + ret = -EINTR; > + return ret; > +} > + > +SYSCALL_DEFINE6(io_pgetevents, > + aio_context_t, ctx_id, > + long, min_nr, > + long, nr, > + struct io_event __user *, events, > + struct timespec __user *, timeout, > + const struct __aio_sigset __user *, usig) > +{ > + struct __aio_sigset ksig = { NULL, }; > + sigset_tksigmask, sigsaved; > + struct timespec64 ts; > + int ret; > + > + if (timeout && unlikely(get_timespec64(&ts, timeout))) > + return -EFAULT; > > - if (timeout) { > - if (unlikely(get_timespec64(&ts, timeout))) > + if (usig && copy_from_user(&ksig, usig, sizeof(ksig))) > + return -EFAULT; > + > + if (ksig.sigmask) { > + if (ksig.sigsetsize != sizeof(sigset_t)) > + return -EINVAL; > + if (copy_from_user(&ksigmask, ksig.sigmask, sizeof(ksigmask))) > return -EFAULT; > + sigdelsetmask(&ksigmask, sigmask(SIGKILL) | sigmask(SIGSTOP)); > +
Re: [PATCH 14/36] aio: implement IOCB_CMD_POLL
Christoph Hellwig writes: > Simple one-shot poll through the io_submit() interface. To poll for > a file descriptor the application should submit an iocb of type > IOCB_CMD_POLL. It will poll the fd for the events specified in the > the first 32 bits of the aio_buf field of the iocb. > > Unlike poll or epoll without EPOLLONESHOT this interface always works > in one shot mode, that is once the iocb is completed, it will have to be > resubmitted. > > Signed-off-by: Christoph Hellwig Acked-by: Jeff Moyer
Re: aio poll, io_pgetevents and a new in-kernel poll API V3
Christoph Hellwig writes: > On Thu, Jan 18, 2018 at 11:44:03AM -0500, Jeff Moyer wrote: >> Jeff Moyer writes: >> >> > FYI, this kernel has issues. It will boot up, but I don't have >> > networking, and even rebooting doesn't succeed. I'm looking into it. >> >> A bisect lands on: eventfd: switch to ->poll_mask. That's not super >> helpful, though. I did run the ltp eventfd2 tests, and they all pass. >> >> The actual issue I get on boot is that several services don't start: >> >> [...] >> >> Christoph, are you able to reproduce this? > > No, I can't reproduce any of that. But I don't have a Fedora system > either, so this might be a new systemd version doing funky things. > > The major change in this version was to call ->poll_mask before setting up > the wait queue as well. This does the right thing for poll and aio poll, > but the more I dig into the epoll code the less sure I am it does the right > thing for it, or in fact that epoll does the right thing in general.. > > Do you still see it with the patch below applied? No, that fixes it for me. -Jeff > diff --git a/fs/select.c b/fs/select.c > index 707abe79536b..1784c1a29253 100644 > --- a/fs/select.c > +++ b/fs/select.c > @@ -53,9 +53,9 @@ __poll_t vfs_poll(struct file *file, struct > poll_table_struct *pt) > head = vfs_get_poll_head(file, events); > if (!head) > return DEFAULT_POLLMASK; > - mask = file->f_op->poll_mask(file, events); > - if (mask) > - return mask; > +// mask = file->f_op->poll_mask(file, events); > +// if (mask) > +// return mask; > > pt->_qproc(file, head, pt); > }
Re: aio poll, io_pgetevents and a new in-kernel poll API V3
Avi Kivity writes: > On 01/18/2018 05:46 PM, Jeff Moyer wrote: >> FYI, this kernel has issues. It will boot up, but I don't have >> networking, and even rebooting doesn't succeed. I'm looking into it. > > FWIW, I'm running an older version of this patchset on my desktop with > no problems so far. Right, it's worth mentioning that v1 worked fine for me. -Jeff >> Christoph Hellwig writes: >> >>> Hi all, >>> >>> this series adds support for the IOCB_CMD_POLL operation to poll for the >>> readyness of file descriptors using the aio subsystem. The API is based >>> on patches that existed in RHAS2.1 and RHEL3, which means it already is >>> supported by libaio. To implement the poll support efficiently new >>> methods to poll are introduced in struct file_operations: get_poll_head >>> and poll_mask. The first one returns a wait_queue_head to wait on >>> (lifetime is bound by the file), and the second does a non-blocking >>> check for the POLL* events. This allows aio poll to work without >>> any additional context switches, unlike epoll. >>> >>> To make the interface fully useful a new io_pgetevents system call is >>> added, which atomically saves and restores the signal mask over the >>> io_pgetevents system call. It it the logical equivalent to pselect and >>> ppoll for io_pgetevents. >>> >>> The corresponding libaio changes for io_pgetevents support and >>> documentation, as well as a test case will be posted in a separate >>> series. >>> >>> The changes were sponsored by Scylladb, and improve performance >>> of the seastar framework up to 10%, while also removing the need >>> for a privileged SCHED_FIFO epoll listener thread. >>> >>> The patches are on top of Als __poll_t annoations, so I've also >>> prepared a git branch on top of those here: >>> >>> git://git.infradead.org/users/hch/vfs.git aio-poll.3 >>> >>> Gitweb: >>> >>> >>> http://git.infradead.org/users/hch/vfs.git/shortlog/refs/heads/aio-poll.3 >>> >>> Libaio changes: >>> >>> https://pagure.io/libaio.git io-poll >>> >>> Seastar changes (not updated for the new io_pgetevens ABI yet): >>> >>> https://github.com/avikivity/seastar/commits/aio >>> >>> Changes since V2: >>> - removed a double initialization >>> - new vfs_get_poll_head helper >>> - document that ->get_poll_head can return NULL >>> - call ->poll_mask before sleeping >>> - various ACKs >>> - add conversion of random to ->poll_mask >>> - add conversion of af_alg to ->poll_mask >>> - lacking ->poll_mask support now returns -EINVAL for IOCB_CMD_POLL >>> - reshuffled the series so that prep patches and everything not >>> requiring the new in-kernel poll API is in the beginning >>> >>> Changes since V1: >>> - handle the NULL ->poll case in vfs_poll >>> - dropped the file argument to the ->poll_mask socket operation >>> - replace the ->pre_poll socket operation with ->get_poll_head as >>> in the file operations >>> >>> -- >>> To unsubscribe, send a message with 'unsubscribe linux-aio' in >>> the body to majord...@kvack.org. For more info on Linux AIO, >>> see: http://www.kvack.org/aio/ >>> Don't email: mailto:"a...@kvack.org";>a...@kvack.org > > > -- > To unsubscribe, send a message with 'unsubscribe linux-aio' in > the body to majord...@kvack.org. For more info on Linux AIO, > see: http://www.kvack.org/aio/ > Don't email: mailto:"a...@kvack.org";>a...@kvack.org
Re: aio poll, io_pgetevents and a new in-kernel poll API V3
Jeff Moyer writes: > FYI, this kernel has issues. It will boot up, but I don't have > networking, and even rebooting doesn't succeed. I'm looking into it. A bisect lands on: eventfd: switch to ->poll_mask. That's not super helpful, though. I did run the ltp eventfd2 tests, and they all pass. The actual issue I get on boot is that several services don't start: [FAILED] Failed to start Modem Manager. See 'systemctl status ModemManager.service' for details. [FAILED] Failed to start Authorization Manager. See 'systemctl status polkit.service' for details. [DEPEND] Dependency failed for Dynamic System Tuning Daemon. [FAILED] Failed to start Network Manager. See 'systemctl status NetworkManager.service' for details. [DEPEND] Dependency failed for Network Manager Wait Online. ... Christoph, are you able to reproduce this? -Jeff > Christoph Hellwig writes: > >> Hi all, >> >> this series adds support for the IOCB_CMD_POLL operation to poll for the >> readyness of file descriptors using the aio subsystem. The API is based >> on patches that existed in RHAS2.1 and RHEL3, which means it already is >> supported by libaio. To implement the poll support efficiently new >> methods to poll are introduced in struct file_operations: get_poll_head >> and poll_mask. The first one returns a wait_queue_head to wait on >> (lifetime is bound by the file), and the second does a non-blocking >> check for the POLL* events. This allows aio poll to work without >> any additional context switches, unlike epoll. >> >> To make the interface fully useful a new io_pgetevents system call is >> added, which atomically saves and restores the signal mask over the >> io_pgetevents system call. It it the logical equivalent to pselect and >> ppoll for io_pgetevents. >> >> The corresponding libaio changes for io_pgetevents support and >> documentation, as well as a test case will be posted in a separate >> series. >> >> The changes were sponsored by Scylladb, and improve performance >> of the seastar framework up to 10%, while also removing the need >> for a privileged SCHED_FIFO epoll listener thread. >> >> The patches are on top of Als __poll_t annoations, so I've also >> prepared a git branch on top of those here: >> >> git://git.infradead.org/users/hch/vfs.git aio-poll.3 >> >> Gitweb: >> >> http://git.infradead.org/users/hch/vfs.git/shortlog/refs/heads/aio-poll.3 >> >> Libaio changes: >> >> https://pagure.io/libaio.git io-poll >> >> Seastar changes (not updated for the new io_pgetevens ABI yet): >> >> https://github.com/avikivity/seastar/commits/aio >> >> Changes since V2: >> - removed a double initialization >> - new vfs_get_poll_head helper >> - document that ->get_poll_head can return NULL >> - call ->poll_mask before sleeping >> - various ACKs >> - add conversion of random to ->poll_mask >> - add conversion of af_alg to ->poll_mask >> - lacking ->poll_mask support now returns -EINVAL for IOCB_CMD_POLL >> - reshuffled the series so that prep patches and everything not >>requiring the new in-kernel poll API is in the beginning >> >> Changes since V1: >> - handle the NULL ->poll case in vfs_poll >> - dropped the file argument to the ->poll_mask socket operation >> - replace the ->pre_poll socket operation with ->get_poll_head as >>in the file operations >> >> -- >> To unsubscribe, send a message with 'unsubscribe linux-aio' in >> the body to majord...@kvack.org. For more info on Linux AIO, >> see: http://www.kvack.org/aio/ >> Don't email: mailto:"a...@kvack.org";>a...@kvack.org > > -- > To unsubscribe, send a message with 'unsubscribe linux-aio' in > the body to majord...@kvack.org. For more info on Linux AIO, > see: http://www.kvack.org/aio/ > Don't email: mailto:"a...@kvack.org";>a...@kvack.org
Re: aio poll, io_pgetevents and a new in-kernel poll API V3
FYI, this kernel has issues. It will boot up, but I don't have networking, and even rebooting doesn't succeed. I'm looking into it. -Jeff Christoph Hellwig writes: > Hi all, > > this series adds support for the IOCB_CMD_POLL operation to poll for the > readyness of file descriptors using the aio subsystem. The API is based > on patches that existed in RHAS2.1 and RHEL3, which means it already is > supported by libaio. To implement the poll support efficiently new > methods to poll are introduced in struct file_operations: get_poll_head > and poll_mask. The first one returns a wait_queue_head to wait on > (lifetime is bound by the file), and the second does a non-blocking > check for the POLL* events. This allows aio poll to work without > any additional context switches, unlike epoll. > > To make the interface fully useful a new io_pgetevents system call is > added, which atomically saves and restores the signal mask over the > io_pgetevents system call. It it the logical equivalent to pselect and > ppoll for io_pgetevents. > > The corresponding libaio changes for io_pgetevents support and > documentation, as well as a test case will be posted in a separate > series. > > The changes were sponsored by Scylladb, and improve performance > of the seastar framework up to 10%, while also removing the need > for a privileged SCHED_FIFO epoll listener thread. > > The patches are on top of Als __poll_t annoations, so I've also > prepared a git branch on top of those here: > > git://git.infradead.org/users/hch/vfs.git aio-poll.3 > > Gitweb: > > http://git.infradead.org/users/hch/vfs.git/shortlog/refs/heads/aio-poll.3 > > Libaio changes: > > https://pagure.io/libaio.git io-poll > > Seastar changes (not updated for the new io_pgetevens ABI yet): > > https://github.com/avikivity/seastar/commits/aio > > Changes since V2: > - removed a double initialization > - new vfs_get_poll_head helper > - document that ->get_poll_head can return NULL > - call ->poll_mask before sleeping > - various ACKs > - add conversion of random to ->poll_mask > - add conversion of af_alg to ->poll_mask > - lacking ->poll_mask support now returns -EINVAL for IOCB_CMD_POLL > - reshuffled the series so that prep patches and everything not >requiring the new in-kernel poll API is in the beginning > > Changes since V1: > - handle the NULL ->poll case in vfs_poll > - dropped the file argument to the ->poll_mask socket operation > - replace the ->pre_poll socket operation with ->get_poll_head as >in the file operations > > -- > To unsubscribe, send a message with 'unsubscribe linux-aio' in > the body to majord...@kvack.org. For more info on Linux AIO, > see: http://www.kvack.org/aio/ > Don't email: mailto:"a...@kvack.org";>a...@kvack.org
Re: [PATCH 32/32] aio: implement io_pgetevents
Christoph Hellwig writes: > On Tue, Jan 16, 2018 at 07:41:24PM -0500, Jeff Moyer wrote: >> I'd be willing to bet the issue is in your io_syscall6 implementation. >> You pass in arg5 where arg6 should be used. Don't feel bad, it took me >> the better part of today to figure that out. :) >> >> Here's an incremental diff on top of what you've posted. Feel free to >> fold it into your patch (and format however you like). You can find the >> libaio changes in my 'aio-poll' branch: >> https://pagure.io/libaio/commits/aio-poll >> >> These changes were run through the libaio test harness, 64 bit and 32 >> bit, so the compat system call was tested. > > Oops, yes. Although I prefer the copy_from_user version, this is what > I had: Much cleaner. Thanks! Jeff
Re: [PATCH 32/32] aio: implement io_pgetevents
Christoph Hellwig writes: > On Wed, Jan 17, 2018 at 04:27:21AM +, Al Viro wrote: >> On Tue, Jan 16, 2018 at 07:41:24PM -0500, Jeff Moyer wrote: >> >if (sigmask) { >> > - if (copy_from_user(&ksigmask, sigmask, sizeof(ksigmask))) >> > + if (!access_ok(VERIFY_READ, sigmask, >> > + sizeof(void *) + sizeof(size_t)) || >> > + __get_user(up, (sigset_t __user * __user *)sigmask) || >> > + __get_user(sigsetsize, >> > + (size_t __user *)(sigmask + sizeof(void * >> >return -EFAULT; >> >> How about copy_from_user() on a struct? Making eyes bleed is fun, but >> people tend to get annoyed when you do it to them... > > Above is the copy & paste version from pselect. I've got both copy_from_user > and that horrible version in my tree, and if we really need this awfull > calling convention copy_from_user certainly is much better. pselect > also should be switched to explicit struct + copy_from_user while > we're at it. In fact glibc defines a struct for the userland version > to start with. Yeah, I did both, but stuck with this variant as that's what pselect did. I'm fine with switching to the struct variant. -Jeff
Re: [PATCH 32/32] aio: implement io_pgetevents
Hi, Christoph, Christoph Hellwig writes: > On Mon, Jan 15, 2018 at 09:53:10AM +0100, Christoph Hellwig wrote: >> > pselect, as an example, crams the sigmask and size together. Why not >> > just do that? libaio can take care of setting that up. >> >> Yes, I could try that. It's just another double indirection for no >> good reason. > > I cna't get this to work - for some reason I always end up with a NULL > sigmask in the kernel. Nevermind that it leads to really crap code > generation. I guess for select the latter doesn't matter too much as > everyone sane uses ppoll anyway. I'd be willing to bet the issue is in your io_syscall6 implementation. You pass in arg5 where arg6 should be used. Don't feel bad, it took me the better part of today to figure that out. :) Here's an incremental diff on top of what you've posted. Feel free to fold it into your patch (and format however you like). You can find the libaio changes in my 'aio-poll' branch: https://pagure.io/libaio/commits/aio-poll These changes were run through the libaio test harness, 64 bit and 32 bit, so the compat system call was tested. Cheers, Jeff diff --git a/fs/aio.c b/fs/aio.c index 57a4e8d..c6d67d0 100644 --- a/fs/aio.c +++ b/fs/aio.c @@ -1991,9 +1991,11 @@ static long do_io_getevents(aio_context_t ctx_id, long, nr, struct io_event __user *, events, struct timespec __user *, timeout, - const sigset_t __user *, sigmask) + void __user *, sigmask) { sigset_tksigmask, sigsaved; + size_t sigsetsize = 0; + sigset_t __user *up = NULL; struct timespec64 ts; int ret; @@ -2001,8 +2003,18 @@ static long do_io_getevents(aio_context_t ctx_id, return -EFAULT; if (sigmask) { - if (copy_from_user(&ksigmask, sigmask, sizeof(ksigmask))) + if (!access_ok(VERIFY_READ, sigmask, + sizeof(void *) + sizeof(size_t)) || + __get_user(up, (sigset_t __user * __user *)sigmask) || + __get_user(sigsetsize, + (size_t __user *)(sigmask + sizeof(void * return -EFAULT; + + if (sigsetsize != sizeof(sigset_t)) + return -EINVAL; + if (copy_from_user(&ksigmask, up, sizeof(ksigmask))) + return -EFAULT; + sigdelsetmask(&ksigmask, sigmask(SIGKILL) | sigmask(SIGSTOP)); sigprocmask(SIG_SETMASK, &ksigmask, &sigsaved); } @@ -2049,8 +2061,11 @@ static long do_io_getevents(aio_context_t ctx_id, compat_long_t, nr, struct io_event __user *, events, struct compat_timespec __user *, timeout, - const compat_sigset_t __user *, sigmask) + void __user *, sig) { + compat_size_t sigsetsize = 0; + compat_sigset_t __user *sigmask; + compat_uptr_t up = 0; sigset_t ksigmask, sigsaved; struct timespec64 t; int ret; @@ -2058,8 +2073,17 @@ static long do_io_getevents(aio_context_t ctx_id, if (timeout && compat_get_timespec64(&t, timeout)) return -EFAULT; - if (sigmask) { - if (get_compat_sigset(&ksigmask, sigmask)) + if (sig) { + if (!access_ok(VERIFY_READ, sig, + sizeof(compat_uptr_t) + sizeof(compat_size_t)) || + __get_user(up, (compat_uptr_t __user *)sig) || + __get_user(sigsetsize, + (compat_size_t __user *)(sig + sizeof(up + return -EFAULT; + + if (sigsetsize != sizeof(compat_sigset_t)) + return -EINVAL; + if (get_compat_sigset(&ksigmask, compat_ptr(up))) return -EFAULT; sigdelsetmask(&ksigmask, sigmask(SIGKILL) | sigmask(SIGSTOP)); sigprocmask(SIG_SETMASK, &ksigmask, &sigsaved); diff --git a/include/linux/compat.h b/include/linux/compat.h index a4cda98..32412f8 100644 --- a/include/linux/compat.h +++ b/include/linux/compat.h @@ -541,7 +541,7 @@ asmlinkage long compat_sys_io_pgetevents(compat_aio_context_t ctx_id, compat_long_t nr, struct io_event __user *events, struct compat_timespec __user *timeout, - const compat_sigset_t __user *sigmask); + void __user *sigmask); asmlinkage long compat_sys_io_submit(compat_aio_context_t ctx_id, int nr, u32 __user *iocb); asmlinkage long compat_sys_mount(const char __user *dev_name, diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h index 3bc9a13..bc79026 100644 --
Re: [PATCH 32/32] aio: implement io_pgetevents
Christoph Hellwig writes: > This is the io_getevents equivalent of ppoll/pselect and allows to > properly mix signals and aio completions (especially with IOCB_CMD_POLL) > and atomically executes the following sequence: > > sigset_t origmask; > > pthread_sigmask(SIG_SETMASK, &sigmask, &origmask); > ret = io_getevents(ctx, min_nr, nr, events, timeout); > pthread_sigmask(SIG_SETMASK, &origmask, NULL); > > Note that unlike many other signal related calls we do not pass a sigmask > size, as that would get us to 7 arguments, which aren't easily supported > by the syscall infrastructure. It seems a lot less painful to just add a > new syscall variant in the unlikely case we're going to increase the > sigset size. pselect, as an example, crams the sigmask and size together. Why not just do that? libaio can take care of setting that up. Cheers, Jeff > > Signed-off-by: Christoph Hellwig > --- > arch/x86/entry/syscalls/syscall_32.tbl | 1 + > arch/x86/entry/syscalls/syscall_64.tbl | 1 + > fs/aio.c | 96 > ++ > include/linux/compat.h | 6 +++ > include/linux/syscalls.h | 6 +++ > include/uapi/asm-generic/unistd.h | 4 +- > kernel/sys_ni.c| 2 + > 7 files changed, 105 insertions(+), 11 deletions(-) > > diff --git a/arch/x86/entry/syscalls/syscall_32.tbl > b/arch/x86/entry/syscalls/syscall_32.tbl > index 448ac2161112..5997c3e9ac3e 100644 > --- a/arch/x86/entry/syscalls/syscall_32.tbl > +++ b/arch/x86/entry/syscalls/syscall_32.tbl > @@ -391,3 +391,4 @@ > 382 i386pkey_free sys_pkey_free > 383 i386statx sys_statx > 384 i386arch_prctl sys_arch_prctl > compat_sys_arch_prctl > +385 i386io_pgetevents sys_io_pgetevents > compat_sys_io_pgetevents > diff --git a/arch/x86/entry/syscalls/syscall_64.tbl > b/arch/x86/entry/syscalls/syscall_64.tbl > index 5aef183e2f85..e995cd2b4e65 100644 > --- a/arch/x86/entry/syscalls/syscall_64.tbl > +++ b/arch/x86/entry/syscalls/syscall_64.tbl > @@ -339,6 +339,7 @@ > 330 common pkey_alloc sys_pkey_alloc > 331 common pkey_free sys_pkey_free > 332 common statx sys_statx > +333 common io_pgetevents sys_io_pgetevents > > # > # x32-specific system call numbers start at 512 to avoid cache impact > diff --git a/fs/aio.c b/fs/aio.c > index cae90ac6e4a3..57a4e8d89f78 100644 > --- a/fs/aio.c > +++ b/fs/aio.c > @@ -1299,10 +1299,6 @@ static long read_events(struct kioctx *ctx, long > min_nr, long nr, > wait_event_interruptible_hrtimeout(ctx->wait, > aio_read_events(ctx, min_nr, nr, event, &ret), > until); > - > - if (!ret && signal_pending(current)) > - ret = -EINTR; > - > return ret; > } > > @@ -1978,13 +1974,54 @@ SYSCALL_DEFINE5(io_getevents, aio_context_t, ctx_id, > struct timespec __user *, timeout) > { > struct timespec64 ts; > + int ret; > + > + if (timeout && unlikely(get_timespec64(&ts, timeout))) > + return -EFAULT; > > - if (timeout) { > - if (unlikely(get_timespec64(&ts, timeout))) > + ret = do_io_getevents(ctx_id, min_nr, nr, events, timeout ? &ts : NULL); > + if (!ret && signal_pending(current)) > + ret = -EINTR; > + return ret; > +} > + > +SYSCALL_DEFINE6(io_pgetevents, > + aio_context_t, ctx_id, > + long, min_nr, > + long, nr, > + struct io_event __user *, events, > + struct timespec __user *, timeout, > + const sigset_t __user *, sigmask) > +{ > + sigset_tksigmask, sigsaved; > + struct timespec64 ts; > + int ret; > + > + if (timeout && unlikely(get_timespec64(&ts, timeout))) > + return -EFAULT; > + > + if (sigmask) { > + if (copy_from_user(&ksigmask, sigmask, sizeof(ksigmask))) > return -EFAULT; > + sigdelsetmask(&ksigmask, sigmask(SIGKILL) | sigmask(SIGSTOP)); > + sigprocmask(SIG_SETMASK, &ksigmask, &sigsaved); > } > > - return do_io_getevents(ctx_id, min_nr, nr, events, timeout ? &ts : > NULL); > + ret = do_io_getevents(ctx_id, min_nr, nr, events, timeout ? &ts : NULL); > + if (signal_pending(current)) { > + if (sigmask) { > + current->saved_sigmask = sigsaved; > + set_restore_sigmask(); > + } > + > + if (!ret) > + ret = -ERESTARTNOHAND; > + } else { > + if (sigmask) > + sigprocmask(SIG_SETMASK, &sigsaved, NULL); > + } > + > + return ret; > } > > #ifdef CONFIG_COMPAT > @@ -1995,13 +2032,52 @@ COMPAT_S
Re: [PATCH 1/1] aio: make sure the input "timeout" value is valid
Matthew Wilcox writes: > On Thu, Dec 14, 2017 at 11:18:30AM +0800, Leizhen (ThunderTown) wrote: >> On 2017/12/14 3:31, Matthew Wilcox wrote: >> > On Wed, Dec 13, 2017 at 11:27:00AM -0500, Jeff Moyer wrote: >> >> Matthew Wilcox writes: >> >> >> >>> On Wed, Dec 13, 2017 at 09:42:52PM +0800, Zhen Lei wrote: >> >>>> Below information is reported by a lower kernel version, and I saw the >> >>>> problem still exist in current version. >> >>> >> >>> I think you're right, but what an awful interface we have here! >> >>> The user must not only fetch it, they must validate it separately? >> >>> And if they forget, then userspace is provoking undefined behaviour? >> >>> Ugh. >> >>> Why not this: >> >> >> >> Why not go a step further and have get_timespec64 check for validity? >> >> I wonder what caller doesn't want that to happen... >> I tried this before. But I found some places call get_timespec64 in the >> following function. >> If we do the check in get_timespec64, the check will be duplicated. >> >> For example: >> static long do_pselect(int n, fd_set __user *inp, fd_set __user *outp, >> >> if (get_timespec64(&ts, tsp)) >> return -EFAULT; >> >> to = &end_time; >> if (poll_select_set_timeout(to, ts.tv_sec, ts.tv_nsec)) >> >> int poll_select_set_timeout(struct timespec64 *to, time64_t sec, long nsec) >> { >> struct timespec64 ts = {.tv_sec = sec, .tv_nsec = nsec}; >> >> if (!timespec64_valid(&ts)) >> return -EINVAL; > > The check is only two comparisons! Why do we have an interface that can > cause bugs for the sake of saving *two comparisons*?! Can we talk about > the cost of a cache miss versus the cost of executing these comparisons? Any update on this? Willy, I'd be okay with your get_valid_timespec64 patch if you wanted to formally submit that. -Jeff
Re: [PATCH 30/32] aio: add delayed cancel support
Christoph Hellwig writes: > On Wed, Jan 10, 2018 at 06:26:39PM -0500, Jeff Moyer wrote: >> >> The upcoming aio poll support would like to be able to complete the >> >> iocb inline from the cancellation context, but that would cause >> >> a lock order reversal. Add support for optionally moving the cancelation >> >> outside the context lock to avoid this reversal. >> >> >> >> Signed-off-by: Christoph Hellwig >> > >> > Acked-by: Jeff Moyer >> >> Actually, let's move these two defines: >> >> #define AIO_IOCB_DELAYED_CANCEL (1 << 0) >> #define AIO_IOCB_CANCELLED (1 << 1) >> >> to include/linux/aio.h so that drivers outside of fs/aio.c can make use >> of them. > > struct aio_kiocb is private to aio.c, so just exposing them won't > do anything useful. If we really need these elsewhere we'll need > to come up with a proper interface. Duh, good point. My main concern is that things like usb gadget will have to deal with races between cancellation and completion on their own. It would be nice if we had infrastructure for them to use. I'll have a look through that code to see if there's something we could or should be doing. Cheers, Jeff
Re: aio poll, io_pgetevents and a new in-kernel poll API V2
"Michael Kerrisk (man-pages)" writes: Hi, Michael, > Are there some man pages patches already for these changes? https://patchwork.kernel.org/patch/10144129/ Cheers, Jeff
Re: [PATCH 30/32] aio: add delayed cancel support
Jeff Moyer writes: > Christoph Hellwig writes: > >> The upcoming aio poll support would like to be able to complete the >> iocb inline from the cancellation context, but that would cause >> a lock order reversal. Add support for optionally moving the cancelation >> outside the context lock to avoid this reversal. >> >> Signed-off-by: Christoph Hellwig > > Acked-by: Jeff Moyer Actually, let's move these two defines: #define AIO_IOCB_DELAYED_CANCEL (1 << 0) #define AIO_IOCB_CANCELLED (1 << 1) to include/linux/aio.h so that drivers outside of fs/aio.c can make use of them. -Jeff
Re: [PATCH 30/32] aio: add delayed cancel support
Christoph Hellwig writes: > The upcoming aio poll support would like to be able to complete the > iocb inline from the cancellation context, but that would cause > a lock order reversal. Add support for optionally moving the cancelation > outside the context lock to avoid this reversal. > > Signed-off-by: Christoph Hellwig Acked-by: Jeff Moyer
Re: [PATCH 29/32] aio: delete iocbs from the active_reqs list in kiocb_cancel
Christoph Hellwig writes: > One we cancel an iocb there is no reason to keep it on the active_reqs > list, given that the list is only used to look for cancelation candidates. > > Signed-off-by: Christoph Hellwig Acked-by: Jeff Moyer
Re: [PATCH 28/32] aio: simplify cancellation
Christoph Hellwig writes: > With the current aio code there is no need for the magic KIOCB_CANCELLED > value, as a cancelation just kicks the driver to queue the completion > ASAP, with all actual completion handling done in another thread. Given > that both the completion path and cancelation take the context lock there > is no need for magic cmpxchg loops either. The cmpxchg was made unnecessary when the complementary xchg was removed in commit bec68faaf ("aio: io_cancel() no longer returns the io_event"). There was no longer a need to ensure only the completion path or the cancellation path delivered the completion event. Now, the completion is always delivered via aio_complete. So yeah, we can get rid of this. Acked-by: Jeff Moyer
Re: [PATCH 27/32] aio: sanitize ki_list handling
Christoph Hellwig writes: > Instead of handcoded non-null checks always initialize ki_list to an > empty list and use list_empty / list_empty_careful on it. Yeah, who knows why list_empty wasn't used from the beginning. In the past, tricks were played by overwriting list pointers with non-null, but that wasn't the case here. > While we're at it also error out on a double call to > kiocb_set_cancel_fn instead of ignoring it. > > Signed-off-by: Christoph Hellwig > --- > fs/aio.c | 13 ++--- > 1 file changed, 6 insertions(+), 7 deletions(-) > > diff --git a/fs/aio.c b/fs/aio.c > index 496b50f9e9b1..fe241b5b44b2 100644 > --- a/fs/aio.c > +++ b/fs/aio.c > @@ -555,13 +555,12 @@ void kiocb_set_cancel_fn(struct kiocb *iocb, > kiocb_cancel_fn *cancel) > struct kioctx *ctx = req->ki_ctx; > unsigned long flags; > > - spin_lock_irqsave(&ctx->ctx_lock, flags); > - > - if (!req->ki_list.next) > - list_add(&req->ki_list, &ctx->active_reqs); > + if (WARN_ON_ONCE(!list_empty(&req->ki_list))) > + return; > > + spin_lock_irqsave(&ctx->ctx_lock, flags); > + list_add_tail(&req->ki_list, &ctx->active_reqs); > req->ki_cancel = cancel; So, this changes behavior from quietly overwriting the prior cancel function to not doing it. I don't think it matters one bit, though. Callers shouldn't do that. Acked-by: Jeff Moyer
Re: [PATCH 26/32] aio: refactor read/write iocb setup
Christoph Hellwig writes: > Don't reference the kiocb structure from the common aio code, and move > any use of it into helper specific to the read/write path. This is in > preparation for aio_poll support that wants to use the space for different > fields. > > Signed-off-by: Christoph Hellwig I think this all looks okay (making sure everything was still cleaned up in the error cases was a bit of a chore). One nit below. > +static int aio_prep_rw(struct kiocb *req, struct iocb *iocb) > +{ > + int ret; > + > + req->ki_filp = fget(iocb->aio_fildes); > + if (unlikely(!req->ki_filp)) > + return -EBADF; > + req->ki_complete = aio_complete_rw; > + req->ki_flags = 0; The above assignment seems superfluous... > + req->ki_pos = iocb->aio_offset; > + req->ki_flags = iocb_flags(req->ki_filp); because of this. Acked-by: Jeff Moyer
Re: [PATCH 24/31] aio: remove an outdated comment in aio_complete
Christoph Hellwig writes: > These days we don't treat sync iocbs special in the aio completion code as > they never use it. Remove the old comment, and move the BUG_ON for a sync > iocb to the top of the function. > > Signed-off-by: Christoph Hellwig Right, this should have been part of commit 599bd19bdc4c6 ("fs: don't allow to complete sync iocbs through aio_complete"). Reviewed-by: Jeff Moyer > --- > fs/aio.c | 11 ++- > 1 file changed, 2 insertions(+), 9 deletions(-) > > diff --git a/fs/aio.c b/fs/aio.c > index 03d59593912d..41fc8ce6bc7f 100644 > --- a/fs/aio.c > +++ b/fs/aio.c > @@ -1088,6 +1088,8 @@ static void aio_complete(struct kiocb *kiocb, long res, > long res2) > unsigned tail, pos, head; > unsigned long flags; > > + BUG_ON(is_sync_kiocb(kiocb)); > + > if (kiocb->ki_flags & IOCB_WRITE) { > struct file *file = kiocb->ki_filp; > > @@ -1100,15 +1102,6 @@ static void aio_complete(struct kiocb *kiocb, long > res, long res2) > file_end_write(file); > } > > - /* > - * Special case handling for sync iocbs: > - * - events go directly into the iocb for fast handling > - * - the sync task with the iocb in its stack holds the single iocb > - *ref, no other paths have a way to get another ref > - * - the sync task helpfully left a reference to itself in the iocb > - */ > - BUG_ON(is_sync_kiocb(kiocb)); > - > if (iocb->ki_list.next) { > unsigned long flags;
Re: [PATCH 23/31] aio: don't print the page size at boot time
Christoph Hellwig writes: > The page size is in no way related to the aio code, and printing it in > the (debug) dmesg at every boot serves no purpose. > > Signed-off-by: Christoph Hellwig Acked-by: Jeff Moyer
Re: [PATCH 6/6] add test for aio poll and io_pgetevents
Christoph Hellwig writes: > + p = fork(); > + switch (p) { [snip] > + default: > + close(pipe1[0]); > + close(pipe2[1]); > + > + io_prep_poll(&iocb, pipe2[0], POLLIN); > + > + ret = io_setup(1, &ctx); > + if (ret) { > + printf("child: io_setup failed\n"); parent > + return 1; > + } > + > + ret = io_submit(ctx, 1, iocbs); > + if (ret != 1) { > + printf("child: io_submit failed\n"); parent Other than that, looks ok to me. Thanks for writing a test! I can fix this up, no need to repost. -Jeff
Re: libaio: resurrect aio poll and add io_pgetevents support
Christoph Hellwig writes: > Hi all, > > this series resurrects IOCB_CMD_POLL support and adds support for the > new io_pgetevents system call, as well as adding a test case. This looks good to me. There may be a couple of changes to the syscall bits, but I can take care of that. I'll review the kernel bits more thoroughly next week. -Jeff
Re: [PATCH 2/6] move _body_io_syscall to the generic syscall.h
Hi, Ben, Thanks for the quick reply. Benjamin LaHaise writes: > On Fri, Jan 05, 2018 at 11:25:17AM -0500, Jeff Moyer wrote: >> Christoph Hellwig writes: >> >> > This way it can be used for the fallback 6-argument version on >> > all architectures. >> > >> > Signed-off-by: Christoph Hellwig >> >> This is a strange way to do things. However, I was never really sold on >> libaio having to implement its own system call wrappers. That decision >> definitely resulted in some maintenance overhead. >> >> Ben, what was your reasoning for not just using syscall? > > The main issue was that glibc's pthreads implementation really sucked back > during initial development and there was a use-case for having the io_XXX > functions usable directly from clone()ed threads that didn't have all the > glibc pthread state setup for per-cpu areas to handle per-thread errno. > That made sense back then, but is rather silly today. Thanks for the background info. > Technically, I'm not sure the generic syscall wrapper is safe to use. The > io_XXX arch wrappers don't modify errno, while it appears the generic one > does. That said, nobody has ever noticed... Good point. Common architectures don't use the generic syscall wrapper, so I'm not sure we can conclude that it won't break anything. At the same time, I'm not sure I want to write and test the io_syscall6 assembly for all of the supported arches. I could save and restore errno. That sounds ugly, but less painful than the other options. Does anyone have any strong preferences? -Jeff
Re: [PATCH 2/6] move _body_io_syscall to the generic syscall.h
Christoph Hellwig writes: > This way it can be used for the fallback 6-argument version on > all architectures. > > Signed-off-by: Christoph Hellwig This is a strange way to do things. However, I was never really sold on libaio having to implement its own system call wrappers. That decision definitely resulted in some maintenance overhead. Ben, what was your reasoning for not just using syscall? -Jeff > --- > src/syscall-generic.h | 6 -- > src/syscall.h | 7 +++ > 2 files changed, 7 insertions(+), 6 deletions(-) > > diff --git a/src/syscall-generic.h b/src/syscall-generic.h > index 24d7c7c..35b8580 100644 > --- a/src/syscall-generic.h > +++ b/src/syscall-generic.h > @@ -2,12 +2,6 @@ > #include > #include > > -#define _body_io_syscall(sname, args...) \ > -{ \ > - int ret = syscall(__NR_##sname, ## args); \ > - return ret < 0 ? -errno : ret; \ > -} > - > #define io_syscall1(type,fname,sname,type1,arg1) \ > type fname(type1 arg1) \ > _body_io_syscall(sname, (long)arg1) > diff --git a/src/syscall.h b/src/syscall.h > index a2da030..3819519 100644 > --- a/src/syscall.h > +++ b/src/syscall.h > @@ -10,6 +10,13 @@ > #define DEFSYMVER(compat_sym, orig_sym, ver_sym) \ > __asm__(".symver " SYMSTR(compat_sym) "," SYMSTR(orig_sym) "@@LIBAIO_" > SYMSTR(ver_sym)); > > +/* generic fallback */ > +#define _body_io_syscall(sname, args...) \ > +{ \ > + int ret = syscall(__NR_##sname, ## args); \ > + return ret < 0 ? -errno : ret; \ > +} > + > #if defined(__i386__) > #include "syscall-i386.h" > #elif defined(__x86_64__)
Re: [PATCH 1/1] aio: make sure the input "timeout" value is valid
Matthew Wilcox writes: > On Wed, Dec 13, 2017 at 09:42:52PM +0800, Zhen Lei wrote: >> Below information is reported by a lower kernel version, and I saw the >> problem still exist in current version. > > I think you're right, but what an awful interface we have here! > The user must not only fetch it, they must validate it separately? > And if they forget, then userspace is provoking undefined behaviour? Ugh. > Why not this: Why not go a step further and have get_timespec64 check for validity? I wonder what caller doesn't want that to happen... -Jeff > > diff --git a/fs/aio.c b/fs/aio.c > index 4adbdcbe753a..fdd16cf897c8 100644 > --- a/fs/aio.c > +++ b/fs/aio.c > @@ -1788,8 +1788,9 @@ SYSCALL_DEFINE5(io_getevents, aio_context_t, ctx_id, > struct timespec64 ts; > > if (timeout) { > - if (unlikely(get_timespec64(&ts, timeout))) > - return -EFAULT; > + int error = get_valid_timespec64(&ts, timeout); > + if (error) > + return error; > } > > return do_io_getevents(ctx_id, min_nr, nr, events, timeout ? &ts : > NULL); > @@ -1805,9 +1806,9 @@ COMPAT_SYSCALL_DEFINE5(io_getevents, > compat_aio_context_t, ctx_id, > struct timespec64 t; > > if (timeout) { > - if (compat_get_timespec64(&t, timeout)) > - return -EFAULT; > - > + int error = compat_get_valid_timespec64(&t, timeout); > + if (error) > + return error; > } > > return do_io_getevents(ctx_id, min_nr, nr, events, timeout ? &t : NULL); > diff --git a/include/linux/compat.h b/include/linux/compat.h > index 0fc36406f32c..578fc0f208d9 100644 > --- a/include/linux/compat.h > +++ b/include/linux/compat.h > @@ -172,6 +172,26 @@ extern int get_compat_itimerspec64(struct itimerspec64 > *its, > extern int put_compat_itimerspec64(const struct itimerspec64 *its, > struct compat_itimerspec __user *uits); > > +static inline __must_check > +int compat_get_valid_timespec64(struct timespec64 *ts, const void __user > *uts) > +{ > + if (unlikely(compat_get_timespec64(ts, uts))) > + return -EFAULT; > + if (unlikely(!timespec64_valid(ts))) > + return -EINVAL; > + return 0; > +} > + > +static inline __must_check > +int compat_get_strict_timespec64(struct timespec64 *ts, const void __user > *uts) > +{ > + if (unlikely(compat_get_timespec64(ts, uts))) > + return -EFAULT; > + if (unlikely(!timespec64_valid_strict(ts))) > + return -EINVAL; > + return 0; > +} > + > struct compat_iovec { > compat_uptr_t iov_base; > compat_size_t iov_len; > diff --git a/include/linux/time.h b/include/linux/time.h > index 4b62a2c0a661..506d87483d04 100644 > --- a/include/linux/time.h > +++ b/include/linux/time.h > @@ -18,6 +18,26 @@ int get_itimerspec64(struct itimerspec64 *it, > int put_itimerspec64(const struct itimerspec64 *it, > struct itimerspec __user *uit); > > +static inline __must_check int get_valid_timespec64(struct timespec64 *ts, > + const struct timespec __user *uts) > +{ > + if (unlikely(get_timespec64(ts, uts))) > + return -EFAULT; > + if (unlikely(!timespec64_valid(ts))) > + return -EINVAL; > + return 0; > +} > + > +static inline __must_check int get_strict_timespec64(struct timespec64 *ts, > + const struct timespec __user *uts) > +{ > + if (unlikely(get_timespec64(ts, uts))) > + return -EFAULT; > + if (unlikely(!timespec64_valid_strict(ts))) > + return -EINVAL; > + return 0; > +} > + > extern time64_t mktime64(const unsigned int year, const unsigned int mon, > const unsigned int day, const unsigned int hour, > const unsigned int min, const unsigned int sec); > > -- > To unsubscribe, send a message with 'unsubscribe linux-aio' in > the body to majord...@kvack.org. For more info on Linux AIO, > see: http://www.kvack.org/aio/ > Don't email: mailto:"a...@kvack.org";>a...@kvack.org
Re: [PATCH 0/5] blkcg: Limit maximum number of aio requests available for cgroup
Kirill Tkhai writes: >> I think you just need to account the completion ring. > > A request of struct aio_kiocb type consumes much more memory, than > struct io_event does. Shouldn't we account it too? Not in my opinion. The completion ring is the part that gets pinned for long periods of time. Just be sure to document this where appropriate. Users/admins should know that the aio completion ring now contributes to their memory budget. Cheers, Jeff
Re: [PATCH 0/5] blkcg: Limit maximum number of aio requests available for cgroup
Kirill Tkhai writes: > On 05.12.2017 00:52, Tejun Heo wrote: >> Hello, Kirill. >> >> On Tue, Dec 05, 2017 at 12:44:00AM +0300, Kirill Tkhai wrote: Can you please explain how this is a fundamental resource which can't be controlled otherwise? >>> >>> Currently, aio_nr and aio_max_nr are global. In case of containers this >>> means that a single container may occupy all aio requests, which are >>> available in the system, and to deprive others possibility to use aio >>> at all. This may happen because of evil intentions of the container's >>> user or because of the program error, when the user makes this occasionally. >> >> Hmm... I see. It feels really wrong to me to make this a first class >> resource because there is a system wide limit. The only reason I can >> think of for the system wide limit is to prevent too much kernel >> memory consumed by creating a lot of aios but that squarely falls >> inside cgroup memory controller protection. If there are other >> reasons why the number of aios should be limited system-wide, please >> bring them up. >> >> If the only reason is kernel memory consumption protection, the only >> thing we need to do is making sure that memory used for aio commands >> are accounted against cgroup kernel memory consumption and >> relaxing/removing system wide limit. > > So, we just use GFP_KERNEL_ACCOUNT flag for allocation of internal aio > structures and pages, and all the memory will be accounted in kmem and > limited by memcg. Looks very good. > > One detail about memory consumption. io_submit() calls primitives > file_operations::write_iter and read_iter. It's not clear for me whether > they consume the same memory as if writev() or readv() system calls > would be used instead. writev() may delay the actual write till dirty > pages limit will be reached, so it seems logic of the accounting should > be the same. So aio mustn't use more not accounted system memory in file > system internals, then simple writev(). > > Could you please to say if you have thoughts about this? I think you just need to account the completion ring. Cheers, Jeff
Re: [PATCH 0/5] blkcg: Limit maximum number of aio requests available for cgroup
Kirill Tkhai writes: > Hi, Benjamin, > > On 04.12.2017 19:52, Benjamin LaHaise wrote: >> Hi Kirill, >> >> On Mon, Dec 04, 2017 at 07:12:51PM +0300, Kirill Tkhai wrote: >>> Hi, >>> >>> this patch set introduces accounting aio_nr and aio_max_nr per blkio cgroup. >>> It may be used to limit number of aio requests, which are available for >>> a cgroup, and could be useful for containers. >>> >>> The accounting is hierarchical, and aio contexts, allocated in child cgroup, >>> are accounted in parent cgroups too. >>> >>> Also, new cgroup file "io.aio_nr" is introduced. It's used to set cgroup >>> aio requests limit, to show current limit and number of currenly occupied >>> requests. >> >> Where are your test cases to check this functionality in the libaio test >> suite? > > I tried to find actual libaio test suite repository url in google, > but there is no certain answer. Also, there is no information in kernel > anywhere (I hope I grepped right). > > Could you please provide url where actual upstream libaio could be obtained? https://pagure.io/libaio Patches can be sent to this list (linux-aio). Cheers, Jeff >>> Patches 1-3 are refactoring. >>> Patch 4 is the place where the accounting actually introduced. >>> Patch 5 adds "io.aio_nr" file. >>> >>> --- >>> >>> Kirill Tkhai (5): >>> aio: Move aio_nr increment to separate function >>> aio: Export aio_nr_lock and aio_max_nr initial value to >>> include/linux/aio.h >>> blkcg: Add blkcg::blkg_aio_nr and blkcg::blkg_aio_max_nr >>> blkcg: Charge aio requests in blkio cgroup hierarchy >>> blkcg: Add cgroup file to configure blkcg::blkg_aio_max_nr >>> >>> >>> block/blk-cgroup.c | 88 +- >>> fs/aio.c | 151 >>> >>> include/linux/aio.h| 21 ++ >>> include/linux/blk-cgroup.h |4 + >>> 4 files changed, 247 insertions(+), 17 deletions(-) >>> >>> -- >>> Signed-off-by: Kirill Tkhai >>> >> > > -- > To unsubscribe, send a message with 'unsubscribe linux-aio' in > the body to majord...@kvack.org. For more info on Linux AIO, > see: http://www.kvack.org/aio/ > Don't email: mailto:"a...@kvack.org";>a...@kvack.org
[patch] tools/testing/radix-tree: ida_get_new_above returns EAGAIN
...not ENOMEM. Fix it. Signed-off-by: Jeff Moyer diff --git a/tools/testing/radix-tree/idr-test.c b/tools/testing/radix-tree/idr-test.c index 30cd0b2..2056d83 100644 --- a/tools/testing/radix-tree/idr-test.c +++ b/tools/testing/radix-tree/idr-test.c @@ -380,7 +380,7 @@ void ida_check_random(void) do { ida_pre_get(&ida, GFP_KERNEL); err = ida_get_new_above(&ida, bit, &id); - } while (err == -ENOMEM); + } while (err == -EAGAIN); assert(!err); assert(id == bit); }
Re: [lkp-robot] [fs] 332391a993: WARNING:at_fs/iomap.c:#iomap_dio_complete
kernel test robot writes: > FYI, we noticed the following commit (built with gcc-6): > > commit: 332391a9935da939319e473b4680e173df75afcf ("fs: Fix page cache > inconsistency when mixing buffered and AIO DIO") > https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master [...] > [ 170.063040] WARNING: CPU: 0 PID: 33 at fs/iomap.c:731 > iomap_dio_complete+0x129/0x130 > [ 170.068758] Modules linked in: xfs dm_mod rpcsec_gss_krb5 > auth_rpcgss nfsv4 dns_resolver sr_mod cdrom sg ata_generic pata_acpi > ppdev ata_piix parport_pc snd_pcm snd_timer snd soundcore serio_raw > pcspkr i2c_piix4 libata floppy parport ip_tables > [ 170.074819] CPU: 0 PID: 33 Comm: kworker/0:1 Not tainted > 4.14.0-rc1-00056-g332391a #1 > [ 170.077915] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS > 1.10.2-1 04/01/2014 > [ 170.081047] Workqueue: dio/vda iomap_dio_complete_work > [ 170.081974] task: 88012e504a80 task.stack: c9758000 > [ 170.084465] RIP: 0010:iomap_dio_complete+0x129/0x130 > [ 170.085381] RSP: 0018:c975be18 EFLAGS: 00010286 > [ 170.087536] RAX: fff0 RBX: 0001 RCX: > 000c > [ 170.088853] RDX: RSI: c975bc50 RDI: > 880075715da9 > [ 170.092113] RBP: c975be30 R08: c975bcf0 R09: > 880075715da8 > [ 170.093254] R10: 000c R11: R12: > 88007afa9300 > [ 170.096475] R13: 88007ac7ad80 R14: 88007afa9338 R15: > 88007afa9330 > [ 170.098187] FS: () GS:88013fc0() > knlGS: > [ 170.101322] CS: 0010 DS: ES: CR0: 80050033 > [ 170.102265] CR2: 7f3ad48a8020 CR3: 7fe16000 CR4: > 06f0 > [ 170.104631] Call Trace: > [ 170.105146] iomap_dio_complete_work+0x25/0x80 > [ 170.106704] process_one_work+0x198/0x3e0 > [ 170.107874] worker_thread+0x4e/0x3e0 > [ 170.108522] kthread+0x114/0x150 > [ 170.110683] ? process_one_work+0x3e0/0x3e0 > [ 170.111362] ? kthread_create_on_node+0x40/0x40 > [ 170.113111] ret_from_fork+0x25/0x30 > [ 170.113752] Code: c9 74 17 49 63 74 24 28 41 8b 54 24 24 85 f6 0f 84 > 2a ff ff ff e9 3e ff ff ff 41 8b 74 24 28 49 8b 45 00 48 63 de e9 39 > ff ff ff <0f> ff eb ca 0f 1f 00 0f 1f 44 00 00 55 48 83 ef 30 48 89 e5 > 41 The test case is doing buffered and direct writes to the same file from different processes, and is intended to trigger the case where invalidate_inode_pages2_range fails. I guess it's easier to hit now since we defer completion to a worker thread? Anyway, we should update xfstests to ignore this warning. Cheers, Jeff [...] > generic/027 > generic/070 > generic/117 > generic/208 > 2017-10-09 15:50:42 ./check generic/027 generic/070 generic/117 generic/208 > FSTYP -- xfs (non-debug) > PLATFORM -- Linux/x86_64 vm-kbuild-4G-1 4.14.0-rc1-00056-g332391a > MKFS_OPTIONS -- -f -bsize=4096 /dev/vdd > MOUNT_OPTIONS -- /dev/vdd /fs/scratch > > generic/02754s > generic/07034s > generic/11758s > generic/208200s > _check_dmesg: something found in dmesg (see > /lkp/benchmarks/xfstests/results//generic/208.dmesg) > Ran: generic/027 generic/070 generic/117 generic/208 > Failures: generic/208 > Failed 1 of 4 tests
Re: [PATCH] direct-io: Minor cleanups in do_blockdev_direct_IO
Nikolay Borisov writes: > We already get the block counts and the calculate the end block at the > beginning of the function. Let's use the local variables for consistency and > readability. No functional changes > > Signed-off-by: Nikolay Borisov Looks ok to me. Reviewed-by: Jeff Moyer