Re: [PATCH] drivers: nvdimm: fix memleak

2023-08-17 Thread Jeff Moyer
Konstantin Meskhidze  writes:

> Memory pointed by 'nd_pmu->pmu.attr_groups' is allocated in function
> 'register_nvdimm_pmu' and is lost after 'kfree(nd_pmu)' call in function
> 'unregister_nvdimm_pmu'.
>
> Co-developed-by: Ivanov Mikhail 
> Signed-off-by: Konstantin Meskhidze 
> ---
>  drivers/nvdimm/nd_perf.c | 1 +
>  1 file changed, 1 insertion(+)
>
> diff --git a/drivers/nvdimm/nd_perf.c b/drivers/nvdimm/nd_perf.c
> index 433bbb68a..14881c4e0 100644
> --- a/drivers/nvdimm/nd_perf.c
> +++ b/drivers/nvdimm/nd_perf.c
> @@ -323,7 +323,8 @@ EXPORT_SYMBOL_GPL(register_nvdimm_pmu);
>  void unregister_nvdimm_pmu(struct nvdimm_pmu *nd_pmu)
>  {
>   perf_pmu_unregister(_pmu->pmu);
>   nvdimm_pmu_free_hotplug_memory(nd_pmu);
> + kfree(nd_pmu->pmu.attr_groups);
>   kfree(nd_pmu);
>  }
>  EXPORT_SYMBOL_GPL(unregister_nvdimm_pmu);

Reviewed-by: Jeff Moyer 




Re: [PATCH] drivers: nvdimm: fix dereference after free

2023-08-17 Thread Jeff Moyer
Konstantin Meskhidze  writes:

> 'nd_pmu->pmu.attr_groups' is dereferenced in function
> 'nvdimm_pmu_free_hotplug_memory' call after it has been freed. Because in
> function 'nvdimm_pmu_free_hotplug_memory' memory pointed by the fields of
> 'nd_pmu->pmu.attr_groups' is deallocated it is necessary to call 'kfree'
> after 'nvdimm_pmu_free_hotplug_memory'.
>
> Co-developed-by: Ivanov Mikhail 
> Signed-off-by: Konstantin Meskhidze 
> ---
>  drivers/nvdimm/nd_perf.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/nvdimm/nd_perf.c b/drivers/nvdimm/nd_perf.c
> index 14881c4e0..2b6dc80d8 100644
> --- a/drivers/nvdimm/nd_perf.c
> +++ b/drivers/nvdimm/nd_perf.c
> @@ -307,10 +307,10 @@ int register_nvdimm_pmu(struct nvdimm_pmu *nd_pmu, 
> struct platform_device *pdev)
>   }
>  
>   rc = perf_pmu_register(_pmu->pmu, nd_pmu->pmu.name, -1);
>   if (rc) {
> - kfree(nd_pmu->pmu.attr_groups);
>   nvdimm_pmu_free_hotplug_memory(nd_pmu);
> + kfree(nd_pmu->pmu.attr_groups);
>   return rc;
>   }
>  
>   pr_info("%s NVDIMM performance monitor support registered\n",

Reviewed-by: Jeff Moyer 




Re: [PATCH 0/3] mm: use memmap_on_memory semantics for dax/kmem

2023-07-14 Thread Jeff Moyer
David Hildenbrand  writes:

> On 13.07.23 21:12, Jeff Moyer wrote:
>> David Hildenbrand  writes:
>>
>>> On 16.06.23 00:00, Vishal Verma wrote:
>>>> The dax/kmem driver can potentially hot-add large amounts of memory
>>>> originating from CXL memory expanders, or NVDIMMs, or other 'device
>>>> memories'. There is a chance there isn't enough regular system memory
>>>> available to fit ythe memmap for this new memory. It's therefore
>>>> desirable, if all other conditions are met, for the kmem managed memory
>>>> to place its memmap on the newly added memory itself.
>>>>
>>>> Arrange for this by first allowing for a module parameter override for
>>>> the mhp_supports_memmap_on_memory() test using a flag, adjusting the
>>>> only other caller of this interface in dirvers/acpi/acpi_memoryhotplug.c,
>>>> exporting the symbol so it can be called by kmem.c, and finally changing
>>>> the kmem driver to add_memory() in chunks of memory_block_size_bytes().
>>>
>>> 1) Why is the override a requirement here? Just let the admin
>>> configure it then then add conditional support for kmem.
>>>
>>> 2) I recall that there are cases where we don't want the memmap to
>>> land on slow memory (which online_movable would achieve). Just imagine
>>> the slow PMEM case. So this might need another configuration knob on
>>> the kmem side.
>>
>>  From my memory, the case where you don't want the memmap to land on
>> *persistent memory* is when the device is small (such as NVDIMM-N), and
>> you want to reserve as much space as possible for the application data.
>> This has nothing to do with the speed of access.
>
> Now that you mention it, I also do remember the origin of the altmap --
> to achieve exactly that: place the memmap on the device.
>
> commit 4b94ffdc4163bae1ec73b6e977ffb7a7da3d06d3
> Author: Dan Williams 
> Date:   Fri Jan 15 16:56:22 2016 -0800
>
> x86, mm: introduce vmem_altmap to augment vmemmap_populate()
>   In support of providing struct page for large persistent memory
> capacities, use struct vmem_altmap to change the default policy for
> allocating memory for the memmap array.  The default vmemmap_populate()
> allocates page table storage area from the page allocator.  Given
> persistent memory capacities relative to DRAM it may not be feasible to
> store the memmap in 'System Memory'.  Instead vmem_altmap represents
> pre-allocated "device pages" to satisfy vmemmap_alloc_block_buf()
> requests.
>
> In PFN_MODE_PMEM (and only then), we use the altmap (don't see a way to
> configure it).

Configuration is done at pmem namespace creation time.  The metadata for
the namespace indicates where the memmap resides.  See the
ndctl-create-namespace man page:

   -M, --map=
   A pmem namespace in "fsdax" or "devdax" mode requires allocation of
   per-page metadata. The allocation can be drawn from either:

   ·   "mem": typical system memory

   ·   "dev": persistent memory reserved from the namespace

   Given relative capacities of "Persistent Memory" to "System
   RAM" the allocation defaults to reserving space out of the
   namespace directly ("--map=dev"). The overhead is 64-bytes 
per
   4K (16GB per 1TB) on x86.

> BUT that case is completely different from the "System RAM" mode. The memmap
> of an NVDIMM in pmem mode is barely used by core-mm (i.e., not the buddy).

Right.  (btw, I don't think system ram mode existed back then.)

> In comparison, if the buddy and everybody else works on the memmap in
> "System RAM", it's much more significant if that resides on slow memory.

Agreed.

> Looking at
>
> commit 9b6e63cbf85b89b2dbffa4955dbf2df8250e5375
> Author: Michal Hocko 
> Date:   Tue Oct 3 16:16:19 2017 -0700
>
> mm, page_alloc: add scheduling point to memmap_init_zone
>   memmap_init_zone gets a pfn range to initialize and it can be
> really
> large resulting in a soft lockup on non-preemptible kernels
> NMI watchdog: BUG: soft lockup - CPU#31 stuck for 23s!
> [kworker/u642:5:1720]
>   [...]
>   task: 88ecd7e902c0 ti: 88eca4e5 task.ti: 88eca4e5
>   RIP: move_pfn_range_to_zone+0x185/0x1d0
>   [...]
>   Call Trace:
> devm_memremap_pages+0x2c7/0x430
> pmem_attach_disk+0x2fd/0x3f0 [nd_pmem]
> nvdimm_bus_probe+0x64/0x110 [libnvdimm]
>
>
> It's hard to tell if that was only required due to t

Re: [PATCH 0/3] mm: use memmap_on_memory semantics for dax/kmem

2023-07-13 Thread Jeff Moyer
David Hildenbrand  writes:

> On 16.06.23 00:00, Vishal Verma wrote:
>> The dax/kmem driver can potentially hot-add large amounts of memory
>> originating from CXL memory expanders, or NVDIMMs, or other 'device
>> memories'. There is a chance there isn't enough regular system memory
>> available to fit ythe memmap for this new memory. It's therefore
>> desirable, if all other conditions are met, for the kmem managed memory
>> to place its memmap on the newly added memory itself.
>>
>> Arrange for this by first allowing for a module parameter override for
>> the mhp_supports_memmap_on_memory() test using a flag, adjusting the
>> only other caller of this interface in dirvers/acpi/acpi_memoryhotplug.c,
>> exporting the symbol so it can be called by kmem.c, and finally changing
>> the kmem driver to add_memory() in chunks of memory_block_size_bytes().
>
> 1) Why is the override a requirement here? Just let the admin
> configure it then then add conditional support for kmem.
>
> 2) I recall that there are cases where we don't want the memmap to
> land on slow memory (which online_movable would achieve). Just imagine
> the slow PMEM case. So this might need another configuration knob on
> the kmem side.

>From my memory, the case where you don't want the memmap to land on
*persistent memory* is when the device is small (such as NVDIMM-N), and
you want to reserve as much space as possible for the application data.
This has nothing to do with the speed of access.

-Jeff




Re: [PATCH v2] libnvdimm/region: Allow setting align attribute on regions without mappings

2022-09-26 Thread Jeff Moyer
Tyler Hicks  writes:

> The alignment constraint for namespace creation in a region was
> increased, from 2M to 16M, for non-PowerPC architectures in v5.7 with
> commit 2522afb86a8c ("libnvdimm/region: Introduce an 'align'
> attribute"). The thought behind the change was that region alignment
> should be uniform across all architectures and, since PowerPC had the
> largest alignment constraint of 16M, all architectures should conform to
> that alignment.
>
> The change regressed namespace creation in pre-defined regions that
> relied on 2M alignment but a workaround was provided in the form of a
> sysfs attribute, named 'align', that could be adjusted to a non-default
> alignment value.
>
> However, the sysfs attribute's store function returned an error (-ENXIO)
> when userspace attempted to change the alignment of a region that had no
> mappings. This affected 2M aligned regions of volatile memory that were
> defined in a device tree using "pmem-region" and created by the
> of_pmem_region_driver, since those regions do not contain mappings
> (ndr_mappings is 0).
>
> Allow userspace to set the align attribute on pre-existing regions that
> do not have mappings so that namespaces can still be within those
> regions, despite not being aligned to 16M.
>
> Link: 
> https://lore.kernel.org/lkml/CA+CK2bDJ3hrWoE91L2wpAk+Yu0_=GtYw=4glddd7mxs321b...@mail.gmail.com
> Fixes: 2522afb86a8c ("libnvdimm/region: Introduce an 'align' attribute")
> Signed-off-by: Tyler Hicks 
> ---
>
> While testing with a recent kernel release (6.0-rc3), I rediscovered
> this bug and eventually realized that I never followed through with
> fixing it upstream. After a year later, here's the v2 that Aneesh
> requested. Sorry about that!
>
> v2:
> - Included Aneesh's feedback to ensure the val is a power of 2 and
>   greater than PAGE_SIZE even for regions without mappings
> - Reused the max_t() trick from default_align() to avoid special
>   casing, with an if-else, when regions have mappings and when they
>   don't
>   + Didn't include Pavel's Reviewed-by since this is a slightly
> different approach than what he reviewed in v1
> - Added a Link commit tag to Pavel's initial problem description
> v1: 
> https://lore.kernel.org/lkml/20210326152645.85225-1-tyhi...@linux.microsoft.com/
>
>  drivers/nvdimm/region_devs.c | 8 +++-
>  1 file changed, 3 insertions(+), 5 deletions(-)
>
> diff --git a/drivers/nvdimm/region_devs.c b/drivers/nvdimm/region_devs.c
> index 473a71bbd9c9..550ea0bd6c53 100644
> --- a/drivers/nvdimm/region_devs.c
> +++ b/drivers/nvdimm/region_devs.c
> @@ -509,16 +509,13 @@ static ssize_t align_store(struct device *dev,
>  {
>   struct nd_region *nd_region = to_nd_region(dev);
>   unsigned long val, dpa;
> - u32 remainder;
> + u32 mappings, remainder;
>   int rc;
>  
>   rc = kstrtoul(buf, 0, );
>   if (rc)
>   return rc;
>  
> - if (!nd_region->ndr_mappings)
> - return -ENXIO;
> -
>   /*
>* Ensure space-align is evenly divisible by the region
>* interleave-width because the kernel typically has no facility
> @@ -526,7 +523,8 @@ static ssize_t align_store(struct device *dev,
>* contribute to the tail capacity in system-physical-address
>* space for the namespace.
>*/
> - dpa = div_u64_rem(val, nd_region->ndr_mappings, );
> + mappings = max_t(u32, 1, nd_region->ndr_mappings);
> + dpa = div_u64_rem(val, mappings, );
>   if (!is_power_of_2(dpa) || dpa < PAGE_SIZE
>   || val > region_size(nd_region) || remainder)
>   return -EINVAL;

The math all looks okay, and this matches what's done in default_align.
Unfortunately, I don't know enough about the power architecture to
understand how you can have a region with no dimms (ndr_mappings == 0).

-Jeff




Re: [RFC PATCH rdma-next 01/10] RDMA: mr: Introduce is_pmem

2022-01-27 Thread Jeff Moyer
"Li, Zhijian"  writes:

> Copied to nvdimm list
>
> Thanks
>
> Zhijian
>
>
> on 2022/1/6 14:12, Li Zhijian wrote:
>>
>> Add Dan to the party :)
>>
>> May i know whether there is any existing APIs to check whether
>> a va/page backs to a nvdimm/pmem ?

I don't know of one.  You could try walk_system_ram_range looking for
IORES_DESC_PERSISTENT_MEMORY, but that's not very efficient.

>>> You need to get Dan to check this out, but I'm pretty sure this should
>>> be more like this:
>>>
>>> if (is_zone_device_page(page) && page->pgmap->type ==
>>> MEMORY_DEVICE_FS_DAX)

You forgot MEMORY_DEVICE_GENERIC.  However, this doesn't guarantee the
memory belongs to persistent memory, only that it is direct access
capable.

Dan, any ideas?

-Jeff




Re: [block] 52f019d43c: ndctl.test-libndctl.fail

2021-03-05 Thread Jeff Moyer
Christoph Hellwig  writes:

> Dan,
>
> can you make any sense of thos report?
>
> name='nfit_test'
> path='/lib/modules/5.11.0-rc5-3-g52f019d43c22/extra/test/nfit_test.ko'
>> check_set_config_data: dimm: 0 read2 data miscompare: 0
>> check_set_config_data: dimm: 0x1 read2 data miscompare: 0
>> check_set_config_data: dimm: 0x100 read2 data miscompare: 0
>> check_set_config_data: dimm: 0x101 read2 data miscompare: 0
>> check_dax_autodetect: dax_ndns: 0x558a74d92f00 ndns: 0x558a74d92f00
>> check_dax_autodetect: dax_ndns: 0x558a74d91f40 ndns: 0x558a74d91f40
>> check_pfn_autodetect: pfn_ndns: 0x558a74d91f40 ndns: 0x558a74d91f40
>> check_pfn_autodetect: pfn_ndns: 0x558a74d8c5e0 ndns: 0x558a74d8c5e0
>> check_btt_autodetect: btt_ndns: 0x558a74d8c5e0 ndns: 0x558a74d8c5e0
>> check_btt_autodetect: btt_ndns: 0x558a74da1390 ndns: 0x558a74da1390
>> check_btt_autodetect: btt_ndns: 0x558a74d8c5e0 ndns: 0x558a74d8c5e0
>> check_btt_autodetect: btt_ndns: 0x558a74d91f40 ndns: 0x558a74d91f40
>> namespace7.0: failed to write /dev/pmem7
>> check_namespaces: namespace7.0 validate_bdev failed
>> ndctl-test1 failed: -6

This is from test/libndctl.c in the ndctl repo:

fd = open(bdevpath, O_RDONLY);
if (fd < 0) {
fprintf(stderr, "%s: failed to open(%s, O_RDONLY)\n",
devname, bdevpath);
return -ENXIO;
}
...
ro = 0;
rc = ioctl(fd, BLKROSET, );
if (rc < 0) {
fprintf(stderr, "%s: BLKROSET failed\n",
devname);
rc = -errno;
goto out;
}

close(fd);
fd = open(bdevpath, O_RDWR|O_DIRECT);
...
if (write(fd, buf, 4096) < 4096) {
fprintf(stderr, "%s: failed to write %s\n",
devname, bdevpath);
rc = -ENXIO;
goto out;
}

HTH,
Jeff



Re: [PATCH] direct-io: Using kmem_cache_zalloc() instead of kmem_cache_alloc() and memset()

2021-02-25 Thread Jeff Moyer
Y, Yang,

Yang Li  writes:

> Fix the following coccicheck warning:
> ./fs/direct-io.c:1155:7-23: WARNING: kmem_cache_zalloc should be used
> for dio, instead of kmem_cache_alloc/memset
>
> Reported-by: Abaci Robot 
> Signed-off-by: Yang Li 
> ---
>  fs/direct-io.c | 4 +---
>  1 file changed, 1 insertion(+), 3 deletions(-)
>
> diff --git a/fs/direct-io.c b/fs/direct-io.c
> index 0957e1b..6ec2935 100644
> --- a/fs/direct-io.c
> +++ b/fs/direct-io.c
> @@ -1152,7 +1152,7 @@ static inline int drop_refcount(struct dio *dio)
>   if (iov_iter_rw(iter) == READ && !count)
>   return 0;
>  
> - dio = kmem_cache_alloc(dio_cache, GFP_KERNEL);
> + dio = kmem_cache_zalloc(dio_cache, GFP_KERNEL);
>   if (!dio)
>   return -ENOMEM;
>   /*
> @@ -1160,8 +1160,6 @@ static inline int drop_refcount(struct dio *dio)
>* performance regression in a database benchmark.  So, we take
>* care to only zero out what's needed.
>*/
> - memset(dio, 0, offsetof(struct dio, pages));
> -

You must have missed the comment just above this memset:

/*
 * Believe it or not, zeroing out the page array caused a .5%
 * performance regression in a database benchmark.  So, we take
 * care to only zero out what's needed.
 */

That's referring to this part of the dio struct:

/*
 * pages[] (and any fields placed after it) are not zeroed out at
 * allocation time.  Don't add new fields after pages[] unless you
 * wish that they not be zeroed.
 */
union {
struct page *pages[DIO_PAGES];  /* page buffer */
struct work_struct complete_work;/* deferred AIO completion */
};
} cacheline_aligned_in_smp;

Nacked-by: Jeff Moyer 



Re: [PATCH] ACPI: NFIT: Fix input validation of bus-family

2020-11-24 Thread Jeff Moyer
Dan Williams  writes:

> Dan reports that smatch thinks userspace can craft an out-of-bound bus
> family number. However, nd_cmd_clear_to_send() blocks all non-zero
> values of bus-family since only the kernel can initiate these commands.
> However, in the speculation path, family is a user controlled array
> index value so mask it for speculation safety. Also, since the
> nd_cmd_clear_to_send() safety is non-obvious and possibly may change in
> the future include input validation is if userspace could get past the
> nd_cmd_clear_to_send() gatekeeper.
>
> Link: http://lore.kernel.org/r/2020113000.GA1237157@mwanda
> Reported-by: Dan Carpenter 
> Fixes: 6450ddbd5d8e ("ACPI: NFIT: Define runtime firmware activation 
> commands")
> Cc: 
> Signed-off-by: Dan Williams 
> ---
>  drivers/acpi/nfit/core.c |6 +-
>  1 file changed, 5 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/acpi/nfit/core.c b/drivers/acpi/nfit/core.c
> index cda7b6c52504..b11b08a60684 100644
> --- a/drivers/acpi/nfit/core.c
> +++ b/drivers/acpi/nfit/core.c
> @@ -5,6 +5,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -479,8 +480,11 @@ int acpi_nfit_ctl(struct nvdimm_bus_descriptor *nd_desc, 
> struct nvdimm *nvdimm,
>   cmd_mask = nd_desc->cmd_mask;
>   if (cmd == ND_CMD_CALL && call_pkg->nd_family) {
>   family = call_pkg->nd_family;
> - if (!test_bit(family, _desc->bus_family_mask))
> + if (family > NVDIMM_BUS_FAMILY_MAX ||
> + !test_bit(family, _desc->bus_family_mask))
>   return -EINVAL;
> + family = array_index_nospec(family,
> + NVDIMM_BUS_FAMILY_MAX + 1);
>   dsm_mask = acpi_desc->family_dsm_mask[family];
>   guid = to_nfit_bus_uuid(family);
>   } else {

Reviewed-by: Jeff Moyer 



Re: [PATCH] fs/dax: Fix pmd vs pte conflict detection

2019-10-21 Thread Jeff Moyer
Dan Williams  writes:

> Check for NULL entries before checking the entry order, otherwise NULL
> is misinterpreted as a present pte conflict. The 'order' check needs to
> happen before the locked check as an unlocked entry at the wrong order
> must fallback to lookup the correct order.

Please include the user-visible effects of the problem in the changelog.

Thanks,
Jeff

>
> Reported-by: Jeff Smits 
> Reported-by: Doug Nelson 
> Cc: 
> Fixes: 23c84eb78375 ("dax: Fix missed wakeup with PMD faults")
> Cc: Jan Kara 
> Cc: Matthew Wilcox (Oracle) 
> Signed-off-by: Dan Williams 
> ---
>  fs/dax.c |5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
>
> diff --git a/fs/dax.c b/fs/dax.c
> index a71881e77204..08160011d94c 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -221,10 +221,11 @@ static void *get_unlocked_entry(struct xa_state *xas, 
> unsigned int order)
>  
>   for (;;) {
>   entry = xas_find_conflict(xas);
> + if (!entry || WARN_ON_ONCE(!xa_is_value(entry)))
> + return entry;
>   if (dax_entry_order(entry) < order)
>   return XA_RETRY_ENTRY;
> - if (!entry || WARN_ON_ONCE(!xa_is_value(entry)) ||
> - !dax_is_locked(entry))
> + if (!dax_is_locked(entry))
>   return entry;
>  
>   wq = dax_entry_waitqueue(xas, entry, );



Re: [PATCH 00/13] nvdimm: Use more common kernel coding style

2019-09-12 Thread Jeff Moyer
Joe Perches  writes:

> Rather than have a local coding style, use the typical kernel style.

The coding style isn't that different from the core kernel, and it's
still quite readable.  I'd rather avoid the churn and the risk of
introducing regressions.  This will also make backports to stable more
of a pain, so it isn't without cost.  Dan, is this really something you
want to do?

-Jeff

>
> Joe Perches (13):
>   nvdimm: Use more typical whitespace
>   nvdimm: Move logical continuations to previous line
>   nvdimm: Use octal permissions
>   nvdimm: Use a more common kernel spacing style
>   nvdimm: Use "unsigned int" in preference to "unsigned"
>   nvdimm: Add and remove blank lines
>   nvdimm: Use typical kernel brace styles
>   nvdimm: Use typical kernel style indentation
>   nvdimm: btt.h: Neaten #defines to improve readability
>   nvdimm: namespace_devs: Move assignment operators
>   nvdimm: Use more common logic testing styles and bare ; positions
>   nvdimm: namespace_devs: Change progess typo to progress
>   nvdimm: Miscellaneous neatening
>
>  drivers/nvdimm/badrange.c   |  22 +-
>  drivers/nvdimm/blk.c|  39 ++--
>  drivers/nvdimm/btt.c| 249 +++--
>  drivers/nvdimm/btt.h|  56 ++---
>  drivers/nvdimm/btt_devs.c   |  68 +++---
>  drivers/nvdimm/bus.c| 138 ++--
>  drivers/nvdimm/claim.c  |  50 ++---
>  drivers/nvdimm/core.c   |  42 ++--
>  drivers/nvdimm/dax_devs.c   |   3 +-
>  drivers/nvdimm/dimm.c   |   3 +-
>  drivers/nvdimm/dimm_devs.c  | 107 -
>  drivers/nvdimm/e820.c   |   2 +-
>  drivers/nvdimm/label.c  | 213 +-
>  drivers/nvdimm/label.h  |   6 +-
>  drivers/nvdimm/namespace_devs.c | 472 
> +---
>  drivers/nvdimm/nd-core.h|  31 +--
>  drivers/nvdimm/nd.h |  94 
>  drivers/nvdimm/nd_virtio.c  |  20 +-
>  drivers/nvdimm/of_pmem.c|   6 +-
>  drivers/nvdimm/pfn_devs.c   | 136 ++--
>  drivers/nvdimm/pmem.c   |  57 ++---
>  drivers/nvdimm/pmem.h   |   2 +-
>  drivers/nvdimm/region.c |  20 +-
>  drivers/nvdimm/region_devs.c| 160 +++---
>  drivers/nvdimm/security.c   | 138 ++--
>  drivers/nvdimm/virtio_pmem.c|  10 +-
>  26 files changed, 1115 insertions(+), 1029 deletions(-)


Re: [PATCH v2 0/3] libnvdimm/security: Enumerate the frozen state and other cleanups

2019-08-28 Thread Jeff Moyer
Dan Williams  writes:

> Changes since v1 [1]:
> - Cleanup patch1, simplify flags return in the overwrite case and
>   consolidate frozen-state cases (Jeff)
> - Clarify the motivation for patch2 (Jeff)
> - Collect Dave's Reviewed-by

The series tests out fine for me.

Thanks, Dan!

-Jeff


>
> [1]: https://lists.01.org/pipermail/linux-nvdimm/2019-August/023133.html
>
> ---
>
> Jeff reported a scenario where ndctl was failing to unlock DIMMs [2].
> Through the course of debug it was discovered that the security
> interface on the DIMMs was in the 'frozen' state disallowing unlock, or
> any security operation.  Unfortunately the kernel only showed that the
> DIMMs were 'locked', not 'locked' and 'frozen'.
>
> Introduce a new sysfs 'frozen' attribute so that ndctl can reflect the
> "security-operations-allowed" state independently of the lock status.
> Then, followup with cleanups related to replacing a security-state-enum
> with a set of flags.
>  
> [2]: https://lists.01.org/pipermail/linux-nvdimm/2019-August/022856.html
>
> ---
>
> Dan Williams (3):
>   libnvdimm/security: Introduce a 'frozen' attribute
>   libnvdimm/security: Tighten scope of nvdimm->busy vs security operations
>   libnvdimm/security: Consolidate 'security' operations
>
>
>  drivers/acpi/nfit/intel.c|   59 ++-
>  drivers/nvdimm/bus.c |2 
>  drivers/nvdimm/dimm_devs.c   |  134 ++
>  drivers/nvdimm/nd-core.h |   51 --
>  drivers/nvdimm/security.c|  199 
> +-
>  include/linux/libnvdimm.h|9 +-
>  tools/testing/nvdimm/dimm_devs.c |   19 +---
>  7 files changed, 226 insertions(+), 247 deletions(-)


Re: [PATCH v2 2/3] libnvdimm/security: Tighten scope of nvdimm->busy vs security operations

2019-08-28 Thread Jeff Moyer
Dan Williams  writes:

> An attempt to freeze DIMMs currently runs afoul of default blocking of
> all security operations in the entry to the 'store' routine for the
> 'security' sysfs attribute.
>
> The blanket blocking of all security operations while the DIMM is in
> active use in a region is too restrictive. The only security operations
> that need to be aware of the ->busy state are those that mutate the
> state of data, i.e. erase and overwrite.
>
> Refactor the ->busy checks to be applied at the entry common entry point
> in __security_store() rather than each of the helper routines to enable
> freeze to be run regardless of busy state.
>
> Reviewed-by: Dave Jiang 
> Signed-off-by: Dan Williams 

Reviewed-by: Jeff Moyer 

> ---
>  drivers/nvdimm/dimm_devs.c |   33 -
>  drivers/nvdimm/security.c  |   10 --
>  2 files changed, 16 insertions(+), 27 deletions(-)
>
> diff --git a/drivers/nvdimm/dimm_devs.c b/drivers/nvdimm/dimm_devs.c
> index 53330625fe07..d837cb9be83d 100644
> --- a/drivers/nvdimm/dimm_devs.c
> +++ b/drivers/nvdimm/dimm_devs.c
> @@ -424,9 +424,6 @@ static ssize_t __security_store(struct device *dev, const 
> char *buf, size_t len)
>   unsigned int key, newkey;
>   int i;
>  
> - if (atomic_read(>busy))
> - return -EBUSY;
> -
>   rc = sscanf(buf, "%"__stringify(SEC_CMD_SIZE)"s"
>   " %"__stringify(KEY_ID_SIZE)"s"
>   " %"__stringify(KEY_ID_SIZE)"s",
> @@ -451,23 +448,25 @@ static ssize_t __security_store(struct device *dev, 
> const char *buf, size_t len)
>   } else if (i == OP_DISABLE) {
>   dev_dbg(dev, "disable %u\n", key);
>   rc = nvdimm_security_disable(nvdimm, key);
> - } else if (i == OP_UPDATE) {
> - dev_dbg(dev, "update %u %u\n", key, newkey);
> - rc = nvdimm_security_update(nvdimm, key, newkey, NVDIMM_USER);
> - } else if (i == OP_ERASE) {
> - dev_dbg(dev, "erase %u\n", key);
> - rc = nvdimm_security_erase(nvdimm, key, NVDIMM_USER);
> + } else if (i == OP_UPDATE || i == OP_MASTER_UPDATE) {
> + dev_dbg(dev, "%s %u %u\n", ops[i].name, key, newkey);
> + rc = nvdimm_security_update(nvdimm, key, newkey, i == OP_UPDATE
> + ? NVDIMM_USER : NVDIMM_MASTER);
> + } else if (i == OP_ERASE || i == OP_MASTER_ERASE) {
> + dev_dbg(dev, "%s %u\n", ops[i].name, key);
> + if (atomic_read(>busy)) {
> + dev_dbg(dev, "Unable to secure erase while DIMM 
> active.\n");
> + return -EBUSY;
> + }
> + rc = nvdimm_security_erase(nvdimm, key, i == OP_ERASE
> + ? NVDIMM_USER : NVDIMM_MASTER);
>   } else if (i == OP_OVERWRITE) {
>   dev_dbg(dev, "overwrite %u\n", key);
> + if (atomic_read(>busy)) {
> + dev_dbg(dev, "Unable to overwrite while DIMM 
> active.\n");
> + return -EBUSY;
> + }
>   rc = nvdimm_security_overwrite(nvdimm, key);
> - } else if (i == OP_MASTER_UPDATE) {
> - dev_dbg(dev, "master_update %u %u\n", key, newkey);
> - rc = nvdimm_security_update(nvdimm, key, newkey,
> - NVDIMM_MASTER);
> - } else if (i == OP_MASTER_ERASE) {
> - dev_dbg(dev, "master_erase %u\n", key);
> - rc = nvdimm_security_erase(nvdimm, key,
> - NVDIMM_MASTER);
>   } else
>   return -EINVAL;
>  
> diff --git a/drivers/nvdimm/security.c b/drivers/nvdimm/security.c
> index 5862d0eee9db..2166e627383a 100644
> --- a/drivers/nvdimm/security.c
> +++ b/drivers/nvdimm/security.c
> @@ -334,11 +334,6 @@ int nvdimm_security_erase(struct nvdimm *nvdimm, 
> unsigned int keyid,
>   || !nvdimm->sec.flags)
>   return -EOPNOTSUPP;
>  
> - if (atomic_read(>busy)) {
> - dev_dbg(dev, "Unable to secure erase while DIMM active.\n");
> - return -EBUSY;
> - }
> -
>   rc = check_security_state(nvdimm);
>   if (rc)
>   return rc;
> @@ -380,11 +375,6 @@ int nvdimm_security_overwrite(struct nvdimm *nvdimm, 
> unsigned int keyid)
>   || !nvdimm->sec.flags)
>   return -EOPNOTSUPP;
>  
> - if (atomic_read(>busy)) {
> - dev_dbg(dev, "Unable to overwrite while DIMM active.\n");
> - return -EBUSY;
> - }
> -
>   if (dev->driver == NULL) {
>   dev_dbg(dev, "Unable to overwrite while DIMM active.\n");
>   return -EINVAL;
>
> ___
> Linux-nvdimm mailing list
> linux-nvd...@lists.01.org
> https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v2 1/3] libnvdimm/security: Introduce a 'frozen' attribute

2019-08-28 Thread Jeff Moyer
Dan Williams  writes:

> In the process of debugging a system with an NVDIMM that was failing to
> unlock it was found that the kernel is reporting 'locked' while the DIMM
> security interface is 'frozen'. Unfortunately the security state is
> tracked internally as an enum which prevents it from communicating the
> difference between 'locked' and 'locked + frozen'. It follows that the
> enum also prevents the kernel from communicating 'unlocked + frozen'
> which would be useful for debugging why security operations like 'change
> passphrase' are disabled.
>
> Ditch the security state enum for a set of flags and introduce a new
> sysfs attribute explicitly for the 'frozen' state. The regression risk
> is low because the 'frozen' state was already blocked behind the
> 'locked' state, but will need to revisit if there were cases where
> applications need 'frozen' to show up in the primary 'security'
> attribute. The expectation is that communicating 'frozen' is mostly a
> helper for debug and status monitoring.
>
> Reviewed-by: Dave Jiang 
> Reported-by: Jeff Moyer 
> Signed-off-by: Dan Williams 

Reviewed-by: Jeff Moyer 

> ---
>  drivers/acpi/nfit/intel.c|   59 ---
>  drivers/nvdimm/bus.c |2 -
>  drivers/nvdimm/dimm_devs.c   |   59 +--
>  drivers/nvdimm/nd-core.h |   21 ++--
>  drivers/nvdimm/security.c|   99 
> ++
>  include/linux/libnvdimm.h|9 ++-
>  tools/testing/nvdimm/dimm_devs.c |   19 ++-
>  7 files changed, 141 insertions(+), 127 deletions(-)
>
> diff --git a/drivers/acpi/nfit/intel.c b/drivers/acpi/nfit/intel.c
> index cddd0fcf622c..1113b679cd7b 100644
> --- a/drivers/acpi/nfit/intel.c
> +++ b/drivers/acpi/nfit/intel.c
> @@ -7,10 +7,11 @@
>  #include "intel.h"
>  #include "nfit.h"
>  
> -static enum nvdimm_security_state intel_security_state(struct nvdimm *nvdimm,
> +static unsigned long intel_security_flags(struct nvdimm *nvdimm,
>   enum nvdimm_passphrase_type ptype)
>  {
>   struct nfit_mem *nfit_mem = nvdimm_provider_data(nvdimm);
> + unsigned long security_flags = 0;
>   struct {
>   struct nd_cmd_pkg pkg;
>   struct nd_intel_get_security_state cmd;
> @@ -27,7 +28,7 @@ static enum nvdimm_security_state 
> intel_security_state(struct nvdimm *nvdimm,
>   int rc;
>  
>   if (!test_bit(NVDIMM_INTEL_GET_SECURITY_STATE, _mem->dsm_mask))
> - return -ENXIO;
> + return 0;
>  
>   /*
>* Short circuit the state retrieval while we are doing overwrite.
> @@ -35,38 +36,42 @@ static enum nvdimm_security_state 
> intel_security_state(struct nvdimm *nvdimm,
>* until the overwrite DSM completes.
>*/
>   if (nvdimm_in_overwrite(nvdimm) && ptype == NVDIMM_USER)
> - return NVDIMM_SECURITY_OVERWRITE;
> + return BIT(NVDIMM_SECURITY_OVERWRITE);
>  
>   rc = nvdimm_ctl(nvdimm, ND_CMD_CALL, _cmd, sizeof(nd_cmd), NULL);
> - if (rc < 0)
> - return rc;
> - if (nd_cmd.cmd.status)
> - return -EIO;
> + if (rc < 0 || nd_cmd.cmd.status) {
> + pr_err("%s: security state retrieval failed (%d:%#x)\n",
> + nvdimm_name(nvdimm), rc, nd_cmd.cmd.status);
> + return 0;
> + }
>  
>   /* check and see if security is enabled and locked */
>   if (ptype == NVDIMM_MASTER) {
>   if (nd_cmd.cmd.extended_state & ND_INTEL_SEC_ESTATE_ENABLED)
> - return NVDIMM_SECURITY_UNLOCKED;
> - else if (nd_cmd.cmd.extended_state &
> - ND_INTEL_SEC_ESTATE_PLIMIT)
> - return NVDIMM_SECURITY_FROZEN;
> - } else {
> - if (nd_cmd.cmd.state & ND_INTEL_SEC_STATE_UNSUPPORTED)
> - return -ENXIO;
> - else if (nd_cmd.cmd.state & ND_INTEL_SEC_STATE_ENABLED) {
> - if (nd_cmd.cmd.state & ND_INTEL_SEC_STATE_LOCKED)
> - return NVDIMM_SECURITY_LOCKED;
> - else if (nd_cmd.cmd.state & ND_INTEL_SEC_STATE_FROZEN
> - || nd_cmd.cmd.state &
> - ND_INTEL_SEC_STATE_PLIMIT)
> - return NVDIMM_SECURITY_FROZEN;
> - else
> - return NVDIMM_SECURITY_UNLOCKED;
> - }
> + set_bit(NVDIMM_SECURITY_UNLOCKED, _flags);
> + else
> + 

Re: [PATCH 2/2] drivers/dax/kmem: give a warning if CONFIG_DEV_DAX_PMEM_COMPAT is enabled

2019-08-22 Thread Jeff Moyer
Jia He  writes:

> commit c221c0b0308f ("device-dax: "Hotplug" persistent memory for use
> like normal RAM") helps to add persistent memory as normal RAM blocks.
> But this driver doesn't work if CONFIG_DEV_DAX_PMEM_COMPAT is enabled.
>
> Here is the debugging call trace when CONFIG_DEV_DAX_PMEM_COMPAT is
> enabled.
> [4.443730]  devm_memremap_pages+0x4b9/0x540
> [4.443733]  dev_dax_probe+0x112/0x220 [device_dax]
> [4.443735]  dax_pmem_compat_probe+0x58/0x92 [dax_pmem_compat]
> [4.443737]  nvdimm_bus_probe+0x6b/0x150
> [4.443739]  really_probe+0xf5/0x3d0
> [4.443740]  driver_probe_device+0x11b/0x130
> [4.443741]  device_driver_attach+0x58/0x60
> [4.443742]  __driver_attach+0xa3/0x140
>
> Then the dax0.0 device will be registered as "nd" bus instead of
> "dax" bus. This causes the error as follows:
> root@ubuntu:~# echo dax0.0 > /sys/bus/dax/drivers/device_dax/unbind
> -bash: echo: write error: No such device
>
> This gives a warning to notify the user.
>
> Signed-off-by: Jia He 
> ---
>  drivers/dax/kmem.c | 5 +
>  1 file changed, 5 insertions(+)
>
> diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c
> index ad62d551d94e..b77f0e880598 100644
> --- a/drivers/dax/kmem.c
> +++ b/drivers/dax/kmem.c
> @@ -93,6 +93,11 @@ static struct dax_device_driver device_dax_kmem_driver = {
>  
>  static int __init dax_kmem_init(void)
>  {
> + if (IS_ENABLED(CONFIG_DEV_DAX_PMEM_COMPAT)) {
> + pr_warn("CONFIG_DEV_DAX_PMEM_COMPAT is not compatible\n");
> + pr_warn("kmem dax driver might not be workable\n");
> + }
> +
>   return dax_driver_register(_dax_kmem_driver);
>  }

This logic is wrong (and the error message is *very* confusing).  You
can have the driver configured, but not loaded.  In that case, the kmem
driver will load and work properly.

When using daxctl to online memory, you already get the following
message:

libdaxctl: daxctl_dev_disable: dax0.0: error: device model is dax-class

That's still not very helpful.  It would be better if the message
suggested a fix (like using migrate-device-model).  Vishal?

Cheers,
Jeff


Re: [PATCH] aio: Fix io_pgetevents() struct __compat_aio_sigset layout

2019-08-22 Thread Jeff Moyer
Guillem Jover  writes:

> This type is used to pass the sigset_t from userland to the kernel,
> but it was using the kernel native pointer type for the member
> representing the compat userland pointer to the userland sigset_t.
>
> This messes up the layout, and makes the kernel eat up both the
> userland pointer and the size members into the kernel pointer, and
> then reads garbage into the kernel sigsetsize. Which makes the sigset_t
> size consistency check fail, and consequently the syscall always
> returns -EINVAL.
>
> This breaks both libaio and strace on 32-bit userland running on 64-bit
> kernels. And there are apparently no users in the wild of the current
> broken layout (at least according to codesearch.debian.org and a brief
> check over github.com search). So it looks safe to fix this directly
> in the kernel, instead of either letting userland deal with this
> permanently with the additional overhead or trying to make the syscall
> infer what layout userland used, even though this is also being worked
> around in libaio to temporarily cope with kernels that have not yet
> been fixed.
>
> We use a proper compat_uptr_t instead of a compat_sigset_t pointer.
>
> Fixes: 7a074e96 ("aio: implement io_pgetevents")
> Signed-off-by: Guillem Jover 

Looks good, thanks for finding and fixing this!

Reviewed-by: Jeff Moyer 

> ---
>  fs/aio.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
>
> diff --git a/fs/aio.c b/fs/aio.c
> index 01e0fb9ae45a..056f291bc66f 100644
> --- a/fs/aio.c
> +++ b/fs/aio.c
> @@ -2179,7 +2179,7 @@ SYSCALL_DEFINE5(io_getevents_time32, __u32, ctx_id,
>  #ifdef CONFIG_COMPAT
>  
>  struct __compat_aio_sigset {
> - compat_sigset_t __user  *sigmask;
> + compat_uptr_t   sigmask;
>   compat_size_t   sigsetsize;
>  };
>  
> @@ -2204,7 +2204,7 @@ COMPAT_SYSCALL_DEFINE6(io_pgetevents,
>   if (usig && copy_from_user(, usig, sizeof(ksig)))
>   return -EFAULT;
>  
> - ret = set_compat_user_sigmask(ksig.sigmask, ksig.sigsetsize);
> + ret = set_compat_user_sigmask(compat_ptr(ksig.sigmask), 
> ksig.sigsetsize);
>   if (ret)
>   return ret;
>  
> @@ -2239,7 +2239,7 @@ COMPAT_SYSCALL_DEFINE6(io_pgetevents_time64,
>   if (usig && copy_from_user(, usig, sizeof(ksig)))
>   return -EFAULT;
>  
> - ret = set_compat_user_sigmask(ksig.sigmask, ksig.sigsetsize);
> + ret = set_compat_user_sigmask(compat_ptr(ksig.sigmask), 
> ksig.sigsetsize);
>   if (ret)
>   return ret;


Re: [PATCH] aio: remove redundant assignment to variable ret

2019-08-22 Thread Jeff Moyer
Colin King  writes:

> From: Colin Ian King 
>
> The variable ret is being set to -EINVAL however this is never read
> and later it is being reassigned to a new value. The assignment is
> redundant and hence can be removed.
>
> Addresses-Coverity: ("Unused Value")
> Signed-off-by: Colin Ian King 
> ---
>  fs/aio.c | 1 -
>  1 file changed, 1 deletion(-)
>
> diff --git a/fs/aio.c b/fs/aio.c
> index f9f441b59966..3e290dfac10a 100644
> --- a/fs/aio.c
> +++ b/fs/aio.c
> @@ -1528,7 +1528,6 @@ static int aio_read(struct kiocb *req, const struct 
> iocb *iocb,
>   file = req->ki_filp;
>   if (unlikely(!(file->f_mode & FMODE_READ)))
>   return -EBADF;
> - ret = -EINVAL;
>       if (unlikely(!file->f_op->read_iter))
>   return -EINVAL;

Acked-by: Jeff Moyer 


Re: [PATCH 2/3] libnvdimm/security: Tighten scope of nvdimm->busy vs security operations

2019-08-19 Thread Jeff Moyer
Dan Williams  writes:

> On Fri, Aug 16, 2019 at 1:49 PM Jeff Moyer  wrote:
>>
>> Dan Williams  writes:
>>
>> > The blanket blocking of all security operations while the DIMM is in
>> > active use in a region is too restrictive. The only security operations
>> > that need to be aware of the ->busy state are those that mutate the
>> > state of data, i.e. erase and overwrite.
>> >
>> > Refactor the ->busy checks to be applied at the entry common entry point
>> > in __security_store() rather than each of the helper routines.
>>
>> I'm not sure this buys you much.  Did you test this on actual hardware
>> to make sure your assumptions are correct?  I guess the worst case is we
>> get an "invalid security state" error back from the firmware
>>
>> Still, what's the motivation for this?
>
> The motivation was when I went to test setting the frozen state and
> found that it complained about the DIMM being active. There's nothing
> wrong with freezing security while the DIMM is mapped. ...but then I
> somehow managed to write this generalized commit message that left out
> the explicit failure I was worried about. Yes, moved too fast, but the
> motivation is "allow freeze while active" and centralize the ->busy
> check so it's just one function to review that common constraint.

OK, thanks for the info.

Reviewed-by: Jeff Moyer 


Re: [PATCH 3/3] libnvdimm/security: Consolidate 'security' operations

2019-08-16 Thread Jeff Moyer
Dan Williams  writes:

> The security operations are exported from libnvdimm/security.c to
> libnvdimm/dimm_devs.c, and libnvdimm/security.c is optionally compiled
> based on the CONFIG_NVDIMM_KEYS config symbol.
>
> Rather than export the operations across compile objects, just move the
> __security_store() entry point to live with the helpers.
>
> Cc: Dave Jiang 
> Signed-off-by: Dan Williams 

Fine by me.

Acked-by: Jeff Moyer 



Re: [PATCH 2/3] libnvdimm/security: Tighten scope of nvdimm->busy vs security operations

2019-08-16 Thread Jeff Moyer
Dan Williams  writes:

> The blanket blocking of all security operations while the DIMM is in
> active use in a region is too restrictive. The only security operations
> that need to be aware of the ->busy state are those that mutate the
> state of data, i.e. erase and overwrite.
>
> Refactor the ->busy checks to be applied at the entry common entry point
> in __security_store() rather than each of the helper routines.

I'm not sure this buys you much.  Did you test this on actual hardware
to make sure your assumptions are correct?  I guess the worst case is we
get an "invalid security state" error back from the firmware

Still, what's the motivation for this?

-Jeff

>
> Cc: Dave Jiang 
> Signed-off-by: Dan Williams 
> ---
>  drivers/nvdimm/dimm_devs.c |   33 -
>  drivers/nvdimm/security.c  |   10 --
>  2 files changed, 16 insertions(+), 27 deletions(-)
>
> diff --git a/drivers/nvdimm/dimm_devs.c b/drivers/nvdimm/dimm_devs.c
> index 53330625fe07..d837cb9be83d 100644
> --- a/drivers/nvdimm/dimm_devs.c
> +++ b/drivers/nvdimm/dimm_devs.c
> @@ -424,9 +424,6 @@ static ssize_t __security_store(struct device *dev, const 
> char *buf, size_t len)
>   unsigned int key, newkey;
>   int i;
>  
> - if (atomic_read(>busy))
> - return -EBUSY;
> -
>   rc = sscanf(buf, "%"__stringify(SEC_CMD_SIZE)"s"
>   " %"__stringify(KEY_ID_SIZE)"s"
>   " %"__stringify(KEY_ID_SIZE)"s",
> @@ -451,23 +448,25 @@ static ssize_t __security_store(struct device *dev, 
> const char *buf, size_t len)
>   } else if (i == OP_DISABLE) {
>   dev_dbg(dev, "disable %u\n", key);
>   rc = nvdimm_security_disable(nvdimm, key);
> - } else if (i == OP_UPDATE) {
> - dev_dbg(dev, "update %u %u\n", key, newkey);
> - rc = nvdimm_security_update(nvdimm, key, newkey, NVDIMM_USER);
> - } else if (i == OP_ERASE) {
> - dev_dbg(dev, "erase %u\n", key);
> - rc = nvdimm_security_erase(nvdimm, key, NVDIMM_USER);
> + } else if (i == OP_UPDATE || i == OP_MASTER_UPDATE) {
> + dev_dbg(dev, "%s %u %u\n", ops[i].name, key, newkey);
> + rc = nvdimm_security_update(nvdimm, key, newkey, i == OP_UPDATE
> + ? NVDIMM_USER : NVDIMM_MASTER);
> + } else if (i == OP_ERASE || i == OP_MASTER_ERASE) {
> + dev_dbg(dev, "%s %u\n", ops[i].name, key);
> + if (atomic_read(>busy)) {
> + dev_dbg(dev, "Unable to secure erase while DIMM 
> active.\n");
> + return -EBUSY;
> + }
> + rc = nvdimm_security_erase(nvdimm, key, i == OP_ERASE
> + ? NVDIMM_USER : NVDIMM_MASTER);
>   } else if (i == OP_OVERWRITE) {
>   dev_dbg(dev, "overwrite %u\n", key);
> + if (atomic_read(>busy)) {
> + dev_dbg(dev, "Unable to overwrite while DIMM 
> active.\n");
> + return -EBUSY;
> + }
>   rc = nvdimm_security_overwrite(nvdimm, key);
> - } else if (i == OP_MASTER_UPDATE) {
> - dev_dbg(dev, "master_update %u %u\n", key, newkey);
> - rc = nvdimm_security_update(nvdimm, key, newkey,
> - NVDIMM_MASTER);
> - } else if (i == OP_MASTER_ERASE) {
> - dev_dbg(dev, "master_erase %u\n", key);
> - rc = nvdimm_security_erase(nvdimm, key,
> - NVDIMM_MASTER);
>   } else
>   return -EINVAL;
>  
> diff --git a/drivers/nvdimm/security.c b/drivers/nvdimm/security.c
> index 5862d0eee9db..2166e627383a 100644
> --- a/drivers/nvdimm/security.c
> +++ b/drivers/nvdimm/security.c
> @@ -334,11 +334,6 @@ int nvdimm_security_erase(struct nvdimm *nvdimm, 
> unsigned int keyid,
>   || !nvdimm->sec.flags)
>   return -EOPNOTSUPP;
>  
> - if (atomic_read(>busy)) {
> - dev_dbg(dev, "Unable to secure erase while DIMM active.\n");
> - return -EBUSY;
> - }
> -
>   rc = check_security_state(nvdimm);
>   if (rc)
>   return rc;
> @@ -380,11 +375,6 @@ int nvdimm_security_overwrite(struct nvdimm *nvdimm, 
> unsigned int keyid)
>   || !nvdimm->sec.flags)
>   return -EOPNOTSUPP;
>  
> - if (atomic_read(>busy)) {
> - dev_dbg(dev, "Unable to overwrite while DIMM active.\n");
> - return -EBUSY;
> - }
> -
>   if (dev->driver == NULL) {
>   dev_dbg(dev, "Unable to overwrite while DIMM active.\n");
>   return -EINVAL;
>
> ___
> Linux-nvdimm mailing list
> linux-nvd...@lists.01.org
> https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH 1/3] libnvdimm/security: Introduce a 'frozen' attribute

2019-08-16 Thread Jeff Moyer
Dan Williams  writes:

> In the process of debugging a system with an NVDIMM that was failing to
> unlock it was found that the kernel is reporting 'locked' while the DIMM
> security interface is 'frozen'. Unfortunately the security state is
> tracked internally as an enum which prevents it from communicating the
> difference between 'locked' and 'locked + frozen'. It follows that the
> enum also prevents the kernel from communicating 'unlocked + frozen'
> which would be useful for debugging why security operations like 'change
> passphrase' are disabled.
>
> Ditch the security state enum for a set of flags and introduce a new
> sysfs attribute explicitly for the 'frozen' state. The regression risk
> is low because the 'frozen' state was already blocked behind the
> 'locked' state, but will need to revisit if there were cases where
> applications need 'frozen' to show up in the primary 'security'
> attribute. The expectation is that communicating 'frozen' is mostly a
> helper for debug and status monitoring.
>
> Cc: Dave Jiang 
> Reported-by: Jeff Moyer 
> Signed-off-by: Dan Williams 
> ---
>  drivers/acpi/nfit/intel.c|   65 ++---
>  drivers/nvdimm/bus.c |2 -
>  drivers/nvdimm/dimm_devs.c   |   59 +--
>  drivers/nvdimm/nd-core.h |   21 ++--
>  drivers/nvdimm/security.c|   99 
> ++
>  include/linux/libnvdimm.h|9 ++-
>  tools/testing/nvdimm/dimm_devs.c |   19 ++-
>  7 files changed, 146 insertions(+), 128 deletions(-)
>
> diff --git a/drivers/acpi/nfit/intel.c b/drivers/acpi/nfit/intel.c
> index cddd0fcf622c..2c51ca4155dc 100644
> --- a/drivers/acpi/nfit/intel.c
> +++ b/drivers/acpi/nfit/intel.c
> @@ -7,10 +7,11 @@
>  #include "intel.h"
>  #include "nfit.h"
>  
> -static enum nvdimm_security_state intel_security_state(struct nvdimm *nvdimm,
> +static unsigned long intel_security_flags(struct nvdimm *nvdimm,
>   enum nvdimm_passphrase_type ptype)
>  {
>   struct nfit_mem *nfit_mem = nvdimm_provider_data(nvdimm);
> + unsigned long security_flags = 0;
>   struct {
>   struct nd_cmd_pkg pkg;
>   struct nd_intel_get_security_state cmd;
> @@ -27,46 +28,54 @@ static enum nvdimm_security_state 
> intel_security_state(struct nvdimm *nvdimm,
>   int rc;
>  
>   if (!test_bit(NVDIMM_INTEL_GET_SECURITY_STATE, _mem->dsm_mask))
> - return -ENXIO;
> + return 0;
>  
>   /*
>* Short circuit the state retrieval while we are doing overwrite.
>* The DSM spec states that the security state is indeterminate
>* until the overwrite DSM completes.
>*/
> - if (nvdimm_in_overwrite(nvdimm) && ptype == NVDIMM_USER)
> - return NVDIMM_SECURITY_OVERWRITE;
> + if (nvdimm_in_overwrite(nvdimm) && ptype == NVDIMM_USER) {
> + set_bit(NVDIMM_SECURITY_OVERWRITE, _flags);
> + return security_flags;
> + }

Why not just
return BIT(NVDIMM_SECURITY_OVERWRITE);
?


>   rc = nvdimm_ctl(nvdimm, ND_CMD_CALL, _cmd, sizeof(nd_cmd), NULL);
> - if (rc < 0)
> - return rc;
> - if (nd_cmd.cmd.status)
> - return -EIO;
> + if (rc < 0 || nd_cmd.cmd.status) {
> + pr_err("%s: security state retrieval failed (%d:%#x)\n",
> + nvdimm_name(nvdimm), rc, nd_cmd.cmd.status);
> + return 0;
> + }
>  
>   /* check and see if security is enabled and locked */
>   if (ptype == NVDIMM_MASTER) {
>   if (nd_cmd.cmd.extended_state & ND_INTEL_SEC_ESTATE_ENABLED)
> - return NVDIMM_SECURITY_UNLOCKED;
> - else if (nd_cmd.cmd.extended_state &
> - ND_INTEL_SEC_ESTATE_PLIMIT)
> - return NVDIMM_SECURITY_FROZEN;
> - } else {
> - if (nd_cmd.cmd.state & ND_INTEL_SEC_STATE_UNSUPPORTED)
> - return -ENXIO;
> - else if (nd_cmd.cmd.state & ND_INTEL_SEC_STATE_ENABLED) {
> - if (nd_cmd.cmd.state & ND_INTEL_SEC_STATE_LOCKED)
> - return NVDIMM_SECURITY_LOCKED;
> - else if (nd_cmd.cmd.state & ND_INTEL_SEC_STATE_FROZEN
> - || nd_cmd.cmd.state &
> - ND_INTEL_SEC_STATE_PLIMIT)
> - return NVDIMM_SECURITY_FROZEN;
> - else
> - return NVDIMM_SECURITY_UNLOCKED;
> - }

Re: [PATCH] aio: add timeout validity check for io_[p]getevents

2019-07-29 Thread Jeff Moyer
Al Viro  writes:

> On Mon, Jul 29, 2019 at 10:57:41AM -0400, Jeff Moyer wrote:
>> Al, can you take this through your tree?
>
> Umm...  Can do, but I had an impression that Arnd and Deepa
> had a tree for timespec-related work.  OTOH, it had been
> relatively quiet last cycle, so...  If they have nothing
> in the area, I can take it through vfs.git.

Hmm, okay.  Yi, can you repost the patch, adding my Reviewed-by tag, and
CC-ing Arnd and Deepa:

Arnd Bergmann 
Deepa Dinamani 

Thanks!
Jeff


Re: [PATCH] aio: add timeout validity check for io_[p]getevents

2019-07-29 Thread Jeff Moyer
Al, can you take this through your tree?

Thanks,
Jeff

Jeff Moyer  writes:

> "zhangyi (F)"  writes:
>
>> io_[p]getevents syscall should return -EINVAL if if timeout is out of
>> range, add this validity check.
>>
>> Signed-off-by: zhangyi (F) 
>> ---
>>  fs/aio.c | 11 +--
>>  1 file changed, 9 insertions(+), 2 deletions(-)
>>
>> diff --git a/fs/aio.c b/fs/aio.c
>> index 01e0fb9..dd967a0 100644
>> --- a/fs/aio.c
>> +++ b/fs/aio.c
>> @@ -2031,10 +2031,17 @@ static long do_io_getevents(aio_context_t ctx_id,
>>  struct io_event __user *events,
>>  struct timespec64 *ts)
>>  {
>> -ktime_t until = ts ? timespec64_to_ktime(*ts) : KTIME_MAX;
>> -struct kioctx *ioctx = lookup_ioctx(ctx_id);
>> +ktime_t until = KTIME_MAX;
>> +struct kioctx *ioctx = NULL;
>>  long ret = -EINVAL;
>>  
>> +if (ts) {
>> +if (!timespec64_valid(ts))
>> +return ret;
>> +until = timespec64_to_ktime(*ts);
>> +}
>> +
>> +ioctx = lookup_ioctx(ctx_id);
>>  if (likely(ioctx)) {
>>  if (likely(min_nr <= nr && min_nr >= 0))
>>  ret = read_events(ioctx, min_nr, nr, events, until);
>
> Reviewed-by: Jeff Moyer 
>
> The previous suggestion[1] of fixing the helpers never materialized, so
> let's just get this fixed, already.
>
> -Jeff
>
> [1] https://marc.info/?l=linux-fsdevel=152209450618587=2
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-aio' in
> the body to majord...@kvack.org.  For more info on Linux AIO,
> see: http://www.kvack.org/aio/
> Don't email: mailto:"a...@kvack.org;>a...@kvack.org


Re: [PATCH] aio: add timeout validity check for io_[p]getevents

2019-07-25 Thread Jeff Moyer
"zhangyi (F)"  writes:

> io_[p]getevents syscall should return -EINVAL if if timeout is out of
> range, add this validity check.
>
> Signed-off-by: zhangyi (F) 
> ---
>  fs/aio.c | 11 +--
>  1 file changed, 9 insertions(+), 2 deletions(-)
>
> diff --git a/fs/aio.c b/fs/aio.c
> index 01e0fb9..dd967a0 100644
> --- a/fs/aio.c
> +++ b/fs/aio.c
> @@ -2031,10 +2031,17 @@ static long do_io_getevents(aio_context_t ctx_id,
>   struct io_event __user *events,
>   struct timespec64 *ts)
>  {
> - ktime_t until = ts ? timespec64_to_ktime(*ts) : KTIME_MAX;
> - struct kioctx *ioctx = lookup_ioctx(ctx_id);
> + ktime_t until = KTIME_MAX;
> + struct kioctx *ioctx = NULL;
>   long ret = -EINVAL;
>  
> + if (ts) {
> + if (!timespec64_valid(ts))
> + return ret;
> + until = timespec64_to_ktime(*ts);
> + }
> +
> + ioctx = lookup_ioctx(ctx_id);
>   if (likely(ioctx)) {
>   if (likely(min_nr <= nr && min_nr >= 0))
>   ret = read_events(ioctx, min_nr, nr, events, until);

Reviewed-by: Jeff Moyer 

The previous suggestion[1] of fixing the helpers never materialized, so
let's just get this fixed, already.

-Jeff

[1] https://marc.info/?l=linux-fsdevel=152209450618587=2


Re: [PATCH -next] mm/hotplug: skip bad PFNs from pfn_to_online_page()

2019-06-14 Thread Jeff Moyer
"Aneesh Kumar K.V"  writes:

> On 6/14/19 10:06 PM, Dan Williams wrote:
>> On Fri, Jun 14, 2019 at 9:26 AM Aneesh Kumar K.V
>>  wrote:
>
>>> Why not let the arch
>>> arch decide the SUBSECTION_SHIFT and default to one subsection per
>>> section if arch is not enabled to work with subsection.
>>
>> Because that keeps the implementation from ever reaching a point where
>> a namespace might be able to be moved from one arch to another. If we
>> can squash these arch differences then we can have a common tool to
>> initialize namespaces outside of the kernel. The one wrinkle is
>> device-dax that wants to enforce the mapping size,
>
> The fsdax have a much bigger issue right? The file system block size
> is the same as PAGE_SIZE and we can't make it portable across archs
> that support different PAGE_SIZE?

File system blocks are not tied to page size.  They can't be *bigger*
than the page size currently, but they can be smaller.

Still, I don't see that as an arugment against trying to make the
namespaces work across architectures.  Consider a user who only has
sector mode namespaces.  We'd like that to work if at all possible.

-Jeff


Re: [PATCH 7/7] libnvdimm/pfn: Fix 'start_pad' implementation

2019-02-22 Thread Jeff Moyer
Dan Williams  writes:

>> Great!  Now let's create another one.
>>
>> # ndctl create-namespace -m fsdax -s 132m
>> libndctl: ndctl_pfn_enable: pfn1.1: failed to enable
>>   Error: namespace1.2: failed to enable
>>
>> failed to create namespace: No such device or address
>>
>> (along with a kernel warning spew)
>
> I assume you're seeing this on the libnvdimm-pending branch?

Yes, but also on linus' master branch.  Things have been operating in
this manner for some time.

>> I understand the desire for expediency.  At some point, though, we have
>> to address the root of the problem.
>
> Well, you've defibrillated me back to reality. We've suffered the
> incomplete broken hacks for 2 years, what's another 10 weeks? I'll
> dust off the sub-section patches and take another run at it.

OK, thanks.  Let me know if I can help at all.

Cheers,
Jeff


Re: [PATCH 7/7] libnvdimm/pfn: Fix 'start_pad' implementation

2019-02-22 Thread Jeff Moyer
Dan Williams  writes:

>> > However, to fix this situation a non-backwards compatible change
>> > needs to be made to the interpretation of the nd_pfn info-block.
>> > ->start_pad needs to be accounted in ->map.map_offset (formerly
>> > ->data_offset), and ->map.map_base (formerly ->phys_addr) needs to be
>> > adjusted to the section aligned resource base used to establish
>> > ->map.map formerly (formerly ->virt_addr).
>> >
>> > The guiding principles of the info-block compatibility fixup is to
>> > maintain the interpretation of ->data_offset for implementations like
>> > the EFI driver that only care about data_access not dax, but cause older
>> > Linux implementations that care about the mode and dax to fail to parse
>> > the new info-block.
>>
>> What if the core mm grew support for hotplug on sub-section boundaries?
>> Would't that fix this problem (and others)?
>
> Yes, I think it would, and I had patches along these lines [2]. Last
> time I looked at this I was asked by core-mm folks to await some
> general refactoring of hotplug [3], and I wasn't proud about some of
> the hacks I used to make it work. In general I'm less confident about
> being able to get sub-section-hotplug over the goal line (core-mm
> resistance to hotplug complexity) vs the local hacks in nvdimm to deal
> with this breakage.

You first posted that patch series in December of 2016.  How long do we
wait for this refactoring to happen?

Meanwhile, we've been kicking this can down the road for far too long.
Simple namespace creation fails to work.  For example:

# ndctl create-namespace -m fsdax -s 128m
  Error: '--size=' must align to interleave-width: 6 and alignment: 2097152
  did you intend --size=132M?

failed to create namespace: Invalid argument

ok, I can't actually create a small, section-aligned namespace.  Let's
bump it up:

# ndctl create-namespace -m fsdax -s 132m
{
  "dev":"namespace1.0",
  "mode":"fsdax",
  "map":"dev",
  "size":"126.00 MiB (132.12 MB)",
  "uuid":"2a5f8fe0-69e2-46bf-98bc-0f5667cd810a",
  "raw_uuid":"f7324317-5cd2-491e-8cd1-ad03770593f2",
  "sector_size":512,
  "blockdev":"pmem1",
  "numa_node":1
}

Great!  Now let's create another one.

# ndctl create-namespace -m fsdax -s 132m
libndctl: ndctl_pfn_enable: pfn1.1: failed to enable
  Error: namespace1.2: failed to enable

failed to create namespace: No such device or address

(along with a kernel warning spew)

And at this point, all further ndctl create-namespace commands fail.
Lovely.  This is a wart that was acceptable only because a fix was
coming.  2+ years later, and we're still adding hacks to work around it
(and there have been *several* hacks).

> Local hacks are always a sad choice, but I think leaving these
> configurations stranded for another kernel cycle is not tenable. It
> wasn't until the github issue did I realize that the problem was
> happening in the wild on NVDIMM-N platforms.

I understand the desire for expediency.  At some point, though, we have
to address the root of the problem.

-Jeff

>
> [2]: 
> https://lore.kernel.org/lkml/148964440651.19438.2288075389153762985.st...@dwillia2-desk3.amr.corp.intel.com/
> [3]: https://lore.kernel.org/lkml/20170319163531.ga25...@dhcp22.suse.cz/
>
>>
>> -Jeff
>>
>> [1] https://github.com/pmem/ndctl/issues/76


Re: [PATCH 7/7] libnvdimm/pfn: Fix 'start_pad' implementation

2019-02-21 Thread Jeff Moyer
Hi, Dan,

Thanks for the comprehensive write-up.  Comments below.

Dan Williams  writes:

> In the beginning the pmem driver simply passed the persistent memory
> resource range to memremap and was done. With the introduction of
> devm_memremap_pages() and vmem_altmap the implementation needed to
> contend with metadata at the start of the resource to indicate whether
> the vmemmap is located in System RAM or Persistent Memory, and reserve
> vmemmap capacity in pmem for the latter case.
>
> The indication of metadata space was communicated in the
> nd_pfn->data_offset property and it was defined to be identical to the
> pmem_device->data_offset property, i.e. relative to the raw resource
> base of the namespace. Up until this point in the driver's development
> pmem_device->phys_addr == __pa(pmem_device->virt_addr). This
> implementation was fine up until the discovery of platforms with
> physical address layouts that mapped Persistent Memory and System RAM to
> the same Linux memory hotplug section (128MB span).
>
> The nd_pfn->start_pad and nd_pfn->end_trunc properties were introduced
> to pad and truncate the capacity to fit within an exclusive Linux
> memory hotplug section span, and it was at this point where the
> ->start_pad definition did not comprehend the pmem_device->phys_addr to
> pmem_device->virt_addr relationship. Platforms in the wild typically
> only collided 'System RAM' at the end of the Persistent Memory range so
> ->start_pad was often zero.
>
> Lately Linux has encountered platforms that collide Persistent Memory
> regions between each other, specifically cases where ->start_pad needed
> to be non-zero. This lead to commit ae86cbfef381 "libnvdimm, pfn: Pad
> pfn namespaces relative to other regions". That commit allowed
> namespaces to be mapped with devm_memremap_pages(). However dax
> operations on those configurations currently fail if attempted within the
> ->start_pad range because pmem_device->data_offset was still relative to
> raw resource base not relative to the section aligned resource range
> mapped by devm_memremap_pages().
>
> Luckily __bdev_dax_supported() caught these failures and simply disabled
> dax.

Let me make sure I understand the current state of things.  Assume a
machine with two persistent memory ranges overlapping the same hotplug
memory section.  Let's take the example from the ndctl github issue[1]:

187c00-967bff : Persistent Memory

/sys/bus/nd/devices/region0/resource: 0x187c00
/sys/bus/nd/devices/region1/resource: 0x577c00

Create a namespace in region1.  That namespace will have a start_pad of
64MiB.  The problem is that, while the correct offset was specified when
laying out the struct pages (via arch_add_memory), the data_offset for
the pmem block device itself does not take the start_pad into account
(despite the comment in the nd_pfn_sb data structure!).  As a result,
the block device starts at the beginning of the address range, but
struct pages only exist for the address space starting 64MiB into the
range.  __bdev_dax_supported fails, because it tries to perform a
direct_access call on sector 0, and there's no pgmap for the address
corresponding to that sector.

So, we can't simply make the code correct (by adding the start_pad to
pmem->data_offset) without bumping the superblock version, because that
would change the size of the block device, and the location of data on
that block device would all be off by 64MiB (and you'd lose the first
64MiB).  Mass hysteria.

> However, to fix this situation a non-backwards compatible change
> needs to be made to the interpretation of the nd_pfn info-block.
> ->start_pad needs to be accounted in ->map.map_offset (formerly
> ->data_offset), and ->map.map_base (formerly ->phys_addr) needs to be
> adjusted to the section aligned resource base used to establish
> ->map.map formerly (formerly ->virt_addr).
>
> The guiding principles of the info-block compatibility fixup is to
> maintain the interpretation of ->data_offset for implementations like
> the EFI driver that only care about data_access not dax, but cause older
> Linux implementations that care about the mode and dax to fail to parse
> the new info-block.

What if the core mm grew support for hotplug on sub-section boundaries?
Would't that fix this problem (and others)?

-Jeff

[1] https://github.com/pmem/ndctl/issues/76


Re: [PATCH 0/7] libnvdimm/pfn: Fix section-alignment padding

2019-02-20 Thread Jeff Moyer
Dan Williams  writes:

> On Tue, Feb 12, 2019 at 1:37 PM Dan Williams  wrote:
>>
>> Lately Linux has encountered platforms that collide Persistent Memory
>> regions between each other, specifically cases where ->start_pad needed
>> to be non-zero. This lead to commit ae86cbfef381 "libnvdimm, pfn: Pad
>> pfn namespaces relative to other regions". That commit allowed
>> namespaces to be mapped with devm_memremap_pages(). However dax
>> operations on those configurations currently fail if attempted within the
>> ->start_pad range because pmem_device->data_offset was still relative to
>> raw resource base not relative to the section aligned resource range
>> mapped by devm_memremap_pages().
>>
>> Luckily __bdev_dax_supported() caught these failures and simply disabled
>> dax. However, to fix this situation a non-backwards compatible change
>> needs to be made to the interpretation of the nd_pfn info-block.
>> ->start_pad needs to be accounted in ->map.map_offset (formerly
>> ->data_offset), and ->map.map_base (formerly ->phys_addr) needs to be
>> adjusted to the section aligned resource base used to establish
>> ->map.map formerly (formerly ->virt_addr).
>>
>> See patch 7 "libnvdimm/pfn: Fix 'start_pad' implementation" for more
>> details, and the ndctl patch series "Improve support + testing for
>> labels + info-blocks" for the corresponding regression test.
>
> Hello valued reviewers, can I plead for a sanity check of at least
> "libnvdimm/pfn: Introduce super-block minimum version requirements"
> and "libnvdimm/pfn: Fix 'start_pad' implementation"? In particular
> Jeff / Johannes this has end user / distro impact in that users may
> lose access to namespaces that are upgraded to v1.3 info-blocks and
> then boot an old kernel. I did not see a way around that sharp edge.

Yes, I'll take a look.

Cheers,
Jeff


Re: [PATCH] acpi/nfit: Fix bus command validation

2019-02-19 Thread Jeff Moyer
Dan Williams  writes:

> Commit 11189c1089da "acpi/nfit: Fix command-supported detection" broke
> ND_CMD_CALL for bus-level commands. The "func = cmd" assumption is only
> valid for:
>
> ND_CMD_ARS_CAP
> ND_CMD_ARS_START
> ND_CMD_ARS_STATUS
> ND_CMD_CLEAR_ERROR
>
> The function number otherwise needs to be pulled from the command
> payload for:
>
> NFIT_CMD_TRANSLATE_SPA
> NFIT_CMD_ARS_INJECT_SET
> NFIT_CMD_ARS_INJECT_CLEAR
> NFIT_CMD_ARS_INJECT_GET
>
> Update cmd_to_func() for the bus case and call it in the common path.
>
> Fixes: 11189c1089da ("acpi/nfit: Fix command-supported detection")
> Cc: 
> Cc: Vishal Verma 
> Reported-by: Grzegorz Burzynski 
> Signed-off-by: Dan Williams 

Tricky code path, eh?

Tested-by: Jeff Moyer 

-Jeff

> ---
>  drivers/acpi/nfit/core.c |   22 --
>  1 file changed, 12 insertions(+), 10 deletions(-)
>
> diff --git a/drivers/acpi/nfit/core.c b/drivers/acpi/nfit/core.c
> index e18ade5d74e9..c34c595d6bb0 100644
> --- a/drivers/acpi/nfit/core.c
> +++ b/drivers/acpi/nfit/core.c
> @@ -415,7 +415,7 @@ static int cmd_to_func(struct nfit_mem *nfit_mem, 
> unsigned int cmd,
>   if (call_pkg) {
>   int i;
>  
> - if (nfit_mem->family != call_pkg->nd_family)
> + if (nfit_mem && nfit_mem->family != call_pkg->nd_family)
>   return -ENOTTY;
>  
>   for (i = 0; i < ARRAY_SIZE(call_pkg->nd_reserved2); i++)
> @@ -424,6 +424,10 @@ static int cmd_to_func(struct nfit_mem *nfit_mem, 
> unsigned int cmd,
>   return call_pkg->nd_command;
>   }
>  
> + /* In the !call_pkg case, bus commands == bus functions */
> + if (!nfit_mem)
> + return cmd;
> +
>   /* Linux ND commands == NVDIMM_FAMILY_INTEL function numbers */
>   if (nfit_mem->family == NVDIMM_FAMILY_INTEL)
>   return cmd;
> @@ -454,17 +458,18 @@ int acpi_nfit_ctl(struct nvdimm_bus_descriptor 
> *nd_desc, struct nvdimm *nvdimm,
>   if (cmd_rc)
>   *cmd_rc = -EINVAL;
>  
> + if (cmd == ND_CMD_CALL)
> + call_pkg = buf;
> + func = cmd_to_func(nfit_mem, cmd, call_pkg);
> + if (func < 0)
> + return func;
> +
>   if (nvdimm) {
>   struct acpi_device *adev = nfit_mem->adev;
>  
>   if (!adev)
>   return -ENOTTY;
>  
> - if (cmd == ND_CMD_CALL)
> - call_pkg = buf;
> - func = cmd_to_func(nfit_mem, cmd, call_pkg);
> - if (func < 0)
> - return func;
>   dimm_name = nvdimm_name(nvdimm);
>   cmd_name = nvdimm_cmd_name(cmd);
>   cmd_mask = nvdimm_cmd_mask(nvdimm);
> @@ -475,12 +480,9 @@ int acpi_nfit_ctl(struct nvdimm_bus_descriptor *nd_desc, 
> struct nvdimm *nvdimm,
>   } else {
>   struct acpi_device *adev = to_acpi_dev(acpi_desc);
>  
> - func = cmd;
>   cmd_name = nvdimm_bus_cmd_name(cmd);
>   cmd_mask = nd_desc->cmd_mask;
> - dsm_mask = cmd_mask;
> - if (cmd == ND_CMD_CALL)
> - dsm_mask = nd_desc->bus_dsm_mask;
> + dsm_mask = nd_desc->bus_dsm_mask;
>   desc = nd_cmd_bus_desc(cmd);
>   guid = to_nfit_uuid(NFIT_DEV_BUS);
>   handle = adev->handle;
>
> ___
> Linux-nvdimm mailing list
> linux-nvd...@lists.01.org
> https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [RFC PATCH] aio: add check for timeout to aviod invalid value

2019-02-18 Thread Jeff Moyer
Tan Xiaojun  writes:

> (When I was testing with syzkaller, I found a lot of ubsan problems. Here 
> is one of them. I am not sure if it needs to be fixed and how it will be 
> fixed. So I sent this patch to ask your opinion.)
>
> Syzkaller reported a UBSAN bug below, which was mainly caused by a large
> negative number passed to the timeout of the io_getevents system call.
>
> 
> UBSAN: Undefined behaviour in ./include/linux/ktime.h:42:14
> signed integer overflow:
> -8427032702788048137 * 10 cannot be represented in type 'long long 
> int'
> CPU: 3 PID: 11668 Comm: syz-executor0 Not tainted 4.19.18-514.55.6.9.x86_64+ 
> #1
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 
> 04/01/2014
> Call Trace:
>  __dump_stack lib/dump_stack.c:77 [inline]
>  dump_stack+0xca/0x13e lib/dump_stack.c:113
>  ubsan_epilogue+0xe/0x81 lib/ubsan.c:159
>  handle_overflow+0x193/0x1e2 lib/ubsan.c:190
>  ktime_set include/linux/ktime.h:42 [inline]
>  timespec64_to_ktime include/linux/ktime.h:78 [inline]
>  do_io_getevents+0x307/0x390 fs/aio.c:2043
>  __do_sys_io_getevents fs/aio.c:2080 [inline]
>  __se_sys_io_getevents fs/aio.c:2068 [inline]
>  __x64_sys_io_getevents+0x163/0x250 fs/aio.c:2068
>  do_syscall_64+0xc8/0x580 arch/x86/entry/common.c:290
>  entry_SYSCALL_64_after_hwframe+0x49/0xbe
> RIP: 0033:0x462589
> Code: f7 d8 64 89 02 b8 ff ff ff ff c3 66 0f 1f 44 00 00 48 89 f8 48 89 f7 48 
> 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 
> 01 c3 48 c7 c1 bc ff ff ff f7 d8 64 89 01 48
> RSP: 002b:7fde9b04ec58 EFLAGS: 0246 ORIG_RAX: 00d0
> RAX: ffda RBX: 0072bf00 RCX: 00462589
> RDX: 0006 RSI:  RDI: 
> RBP: 0005 R08: 2100 R09: 
> R10: 2040 R11: 0246 R12: 7fde9b04f6bc
> R13: 004bd1f0 R14: 006f6b60 R15: 
> 
> bond0 (unregistering): Released all slaves
>
> The timeout described in "man io_getevents" does not say whether it
> can be negative or not, but as a waiting time, a negative number has
> no meaning. So I add check to avoid this case.

It's embarrassing that this bug is still present.  See, for example,
this discussion, started in 2015:
  
https://lore.kernel.org/lkml/cact4y+bbxvylq6ltokrktnlthqlhcw-bmp3aqp3mjdavr9f...@mail.gmail.com/

I could swear it was brought up again since then, but I can't find
records of that.

> Signed-off-by: Tan Xiaojun 
> ---
>  fs/aio.c | 7 ++-
>  1 file changed, 6 insertions(+), 1 deletion(-)
>
> diff --git a/fs/aio.c b/fs/aio.c
> index f4d..28e0fa6 100644
> --- a/fs/aio.c
> +++ b/fs/aio.c
> @@ -2078,10 +2078,15 @@ static long do_io_getevents(aio_context_t ctx_id,
>   struct io_event __user *events,
>   struct timespec64 *ts)
>  {
> - ktime_t until = ts ? timespec64_to_ktime(*ts) : KTIME_MAX;
> + ktime_t until;
>   struct kioctx *ioctx = lookup_ioctx(ctx_id);
>   long ret = -EINVAL;
>  
> + if (ts && !timespec64_valid(ts))
> + return -EINVAL;
> +
> + until = ts ? timespec64_to_ktime(*ts) : KTIME_MAX;
> +
>   if (likely(ioctx)) {
>   if (likely(min_nr <= nr && min_nr >= 0))
>   ret = read_events(ioctx, min_nr, nr, events, until);

Looks good to me.  Thanks for fixing this.

Reviewed-by: Jeff Moyer 


Re: [PATCH] blk-mq: fix the cmd_flag_name array

2019-01-24 Thread Jeff Moyer
Jianchao Wang  writes:

> Swap REQ_NOWAIT and REQ_NOUNMAP and add REQ_HIPRI.
>
> Signed-off-by: Jianchao Wang 
> ---
>  block/blk-mq-debugfs.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
> index 90d6876..f812083 100644
> --- a/block/blk-mq-debugfs.c
> +++ b/block/blk-mq-debugfs.c
> @@ -308,8 +308,9 @@ static const char *const cmd_flag_name[] = {
>   CMD_FLAG_NAME(PREFLUSH),
>   CMD_FLAG_NAME(RAHEAD),
>   CMD_FLAG_NAME(BACKGROUND),
> - CMD_FLAG_NAME(NOUNMAP),
>   CMD_FLAG_NAME(NOWAIT),
> + CMD_FLAG_NAME(NOUNMAP),
> + CMD_FLAG_NAME(HIPRI),
>  };
>  #undef CMD_FLAG_NAME

Acked-by: Jeff Moyer 

You might consider also adding a comment above the req_flag_bits enum
noting that modifications also need to be propagated to cmd_flag_name.


Re: [PATCH 0/4] Allow persistent memory to be used like normal RAM

2019-01-17 Thread Jeff Moyer
Keith Busch  writes:

>> Keith, you seem to be implying that there are platforms that won't
>> support memory mode.  Do you also have some insight into how customers
>> want to use this, beyond my speculation?  It's really frustrating to see
>> patch sets like this go by without any real use cases provided.
>
> Right, most NFIT reporting platforms today don't have memory mode, and
> the kernel currently only supports the persistent DAX mode with these.
> This series adds another option for those platforms.

All NFIT reporting platforms today are shipping NVDIMM-Ns, where it
makes absolutely no sense to use them as regular DRAM.  I don't think
that's a good argument to make.

> I think numactl as you mentioned is the first consideration for how
> customers may make use. Dave or Dan might have other use cases in mind.

Well, it sure looks like this took a lot of work, so I thought there
were known use cases or users asking for this functionality.

Cheers,
Jeff


Re: [PATCH 0/4] Allow persistent memory to be used like normal RAM

2019-01-17 Thread Jeff Moyer
Keith Busch  writes:

> On Thu, Jan 17, 2019 at 11:29:10AM -0500, Jeff Moyer wrote:
>> Dave Hansen  writes:
>> > Persistent memory is cool.  But, currently, you have to rewrite
>> > your applications to use it.  Wouldn't it be cool if you could
>> > just have it show up in your system like normal RAM and get to
>> > it like a slow blob of memory?  Well... have I got the patch
>> > series for you!
>> 
>> So, isn't that what memory mode is for?
>>   
>> https://itpeernetwork.intel.com/intel-optane-dc-persistent-memory-operating-modes/
>> 
>> Why do we need this code in the kernel?
>
> I don't think those are the same thing. The "memory mode" in the link
> refers to platforms that sequester DRAM to side cache memory access, where
> this series doesn't have that platform dependency nor hides faster DRAM.

OK, so you are making two arguments, here.  1) platforms may not support
memory mode, and 2) this series allows for performance differentiated
memory (even though applications may not modified to make use of
that...).

With this patch set, an unmodified application would either use:

1) whatever memory it happened to get
2) only the faster dram (via numactl --membind=)
3) only the slower pmem (again, via numactl --membind1)
4) preferentially one or the other (numactl --preferred=)

The other options are:
- as mentioned above, memory mode, which uses DRAM as a cache for the
  slower persistent memory.  Note that it isn't all or nothing--you can
  configure your system with both memory mode and appdirect.  The
  limitation, of course, is that your platform has to support this.

  This seems like the obvious solution if you want to make use of the
  larger pmem capacity as regular volatile memory (and your platform
  supports it).  But maybe there is some other limitation that motivated
  this work?

- libmemkind or pmdk.  These options typically* require application
  modifications, but allow those applications to actively decide which
  data lives in fast versus slow media.

  This seems like the obvious answer for applications that care about
  access latency.

* you could override the system malloc, but some libraries/application
  stacks already do that, so it isn't a universal solution.

Listing something like this in the headers of these patch series would
considerably reduce the head-scratching for reviewers.

Keith, you seem to be implying that there are platforms that won't
support memory mode.  Do you also have some insight into how customers
want to use this, beyond my speculation?  It's really frustrating to see
patch sets like this go by without any real use cases provided.

Cheers,
Jeff


Re: [PATCH 0/4] Allow persistent memory to be used like normal RAM

2019-01-17 Thread Jeff Moyer
Dave Hansen  writes:

> Persistent memory is cool.  But, currently, you have to rewrite
> your applications to use it.  Wouldn't it be cool if you could
> just have it show up in your system like normal RAM and get to
> it like a slow blob of memory?  Well... have I got the patch
> series for you!

So, isn't that what memory mode is for?
  
https://itpeernetwork.intel.com/intel-optane-dc-persistent-memory-operating-modes/

Why do we need this code in the kernel?

-Jeff


Re: [PATCH v3 0/2] acpi/nfit: Fix command-supported detection

2019-01-15 Thread Jeff Moyer
Dan Williams  writes:

> Changes since v2 [1]:
> * Don't allow ND_CMD_CALL to bypass dsm_mask restrictions (Jeff)
>
> [1]: https://lists.01.org/pipermail/linux-nvdimm/2019-January/019498.html
>
> ---
>
> One last resend to make sure all the last bits of thrash have settled.

LGTM.

Thanks!
Jeff


Re: [PATCH v2 0/2] acpi/nfit: Fix command-supported detection

2019-01-15 Thread Jeff Moyer
Dan Williams  writes:

> On Tue, Jan 15, 2019 at 6:16 AM Jeff Moyer  wrote:
>>
>> Dan Williams  writes:
>>
>> > Changes since v1 [1]:
>> > * Include another patch make sure that function-number zero can be
>> >   safely used as an invalid function number (Jeff)
>> > * Add a comment clarifying why zero is an invalid function number (Jeff)
>> > * Pass nfit_mem to cmd_to_func() (Jeff)
>> > * Collect a Tested-by from Sujith
>> > [1]: https://lists.01.org/pipermail/linux-nvdimm/2019-January/019435.html
>>
>> For the series:
>>
>> Reviewed-by: Jeff Moyer 
>>
>> Thanks, Dan!
>
> Thanks, although I just realized one more change. The ND_CMD_CALL case
> should zero out command after the function translation, otherwise
> userspace can call functions that the kernel is blocking in the
> dsm_mask.
>
> Holler if this invalidates your "Reviewed-by".

AAAHH

:)

> diff --git a/drivers/acpi/nfit/core.c b/drivers/acpi/nfit/core.c
> index 87e02f281e51..d7747aceb7ab 100644
> --- a/drivers/acpi/nfit/core.c
> +++ b/drivers/acpi/nfit/core.c
> @@ -463,6 +463,12 @@ int acpi_nfit_ctl(struct nvdimm_bus_descriptor
> *nd_desc, struct nvdimm *nvdimm,
> func = cmd_to_func(nfit_mem, cmd, buf);
> if (func < 0)
> return func;
> +   /*
> +* In the ND_CMD_CALL case we're now dependent on 'func'
> +* being validated by the dimm's dsm_mask
> +*/
> +   if (cmd == ND_CMD_CALL)
> +   cmd = 0;
> dimm_name = nvdimm_name(nvdimm);
> cmd_name = nvdimm_cmd_name(cmd);
> cmd_mask = nvdimm_cmd_mask(nvdimm);
dsm_mask = nfit_mem->dsm_mask;
desc = nd_cmd_dimm_desc(cmd);

That sure doesn't look right.  Now cmd_name and desc will be wrong.

> @@ -477,8 +483,10 @@ int acpi_nfit_ctl(struct nvdimm_bus_descriptor
> *nd_desc, struct nvdimm *nvdimm,
> cmd_name = nvdimm_bus_cmd_name(cmd);
> cmd_mask = nd_desc->cmd_mask;
> dsm_mask = cmd_mask;
> -   if (cmd == ND_CMD_CALL)
> +   if (cmd == ND_CMD_CALL) {
> dsm_mask = nd_desc->bus_dsm_mask;
> +   cmd = 0;
> +   }
> desc = nd_cmd_bus_desc(cmd);

And again here.

We could reorder the zeroing, or you could modify the check for a valid
comand/function.  Something like this?

/*
 * Check for a valid command.  For ND_CMD_CALL, we also
 * have to make sure that the DSM function is supported.
 */
if (cmd == ND_CMD_CALL && !test_bit(func, _mask))
return -ENOTTY;
else if (!test_bit(cmd, _mask))
return -ENOTTY;

Which way do you think is cleaner?

Cheers,
Jeff


Re: [PATCH v2 0/2] acpi/nfit: Fix command-supported detection

2019-01-15 Thread Jeff Moyer
Dan Williams  writes:

> Changes since v1 [1]:
> * Include another patch make sure that function-number zero can be
>   safely used as an invalid function number (Jeff)
> * Add a comment clarifying why zero is an invalid function number (Jeff)
> * Pass nfit_mem to cmd_to_func() (Jeff)
> * Collect a Tested-by from Sujith
> [1]: https://lists.01.org/pipermail/linux-nvdimm/2019-January/019435.html

For the series:

Reviewed-by: Jeff Moyer 

Thanks, Dan!


>
> ---
>
> Quote patch2 changelog:
>
> The _DSM function number validation only happens to succeed when the
> generic Linux command number translation corresponds with a
> DSM-family-specific function number. This breaks NVDIMM-N
> implementations that correctly implement _LSR, _LSW, and _LSI, but do
> not happen to publish support for DSM function numbers 4, 5, and 6.
>
> Recall that the support for _LS{I,R,W} family of methods results in the
> DIMM being marked as supporting those command numbers at
> acpi_nfit_register_dimms() time. The DSM function mask is only used for
> ND_CMD_CALL support of non-NVDIMM_FAMILY_INTEL devices.
>
> ---
>
> Dan Williams (2):
>   acpi/nfit: Block function zero DSMs
>   acpi/nfit: Fix command-supported detection
>
>
>  drivers/acpi/nfit/core.c |   53 
> ++
>  1 file changed, 39 insertions(+), 14 deletions(-)


Re: [PATCH] acpi/nfit: Fix command-supported detection

2019-01-14 Thread Jeff Moyer
Dan Williams  writes:

> On Mon, Jan 14, 2019 at 8:43 AM Dan Williams  wrote:
>> On Mon, Jan 14, 2019 at 7:19 AM Jeff Moyer  wrote:
> [..]
>> > > +
>> > > + if (cmd == ND_CMD_CALL) {
>> > > + int i;
>> > > +
>> > > + if (call_pkg && nfit_mem->family != call_pkg->nd_family)
>> > > + return -ENOTTY;
>> > > +
>> > > + for (i = 0; i < ARRAY_SIZE(call_pkg->nd_reserved2); i++)
>> > > + if (call_pkg->nd_reserved2[i])
>> > > + return -EINVAL;
>> > > + return call_pkg->nd_command;
>> > > + }
>> > > +
>> > > + /* Linux ND commands == NVDIMM_FAMILY_INTEL function numbers */
>> > > + if (nfit_mem->family == NVDIMM_FAMILY_INTEL)
>> > > + return cmd;
>> > > + return 0;
>> >
>> > Function zero?  Is that really the right thing to return here?
>>
>> Yes, function zero is never set in n
>
> ...whoops fumble fingered "send"
>
> Function zero should never be set in nfit_mem->dsm_mask, although the
> NVDIMM_FAMILY_MSFT mask violates this assumption. I'll fix that up.

OK, I think I see how it all fits together now, thanks.  It would be
nice if you documented this magical 0 return somehow.

Cheers,
Jeff


Re: [PATCH] acpi/nfit: Fix command-supported detection

2019-01-14 Thread Jeff Moyer
Dan Williams  writes:

> The _DSM function number validation only happens to succeed when the
> generic Linux command number translation corresponds with a
> DSM-family-specific function number. This breaks NVDIMM-N
> implementations that correctly implement _LSR, _LSW, and _LSI, but do
> not happen to publish support for DSM function numbers 4, 5, and 6.
>
> Recall that the support for _LS{I,R,W} family of methods results in the
> DIMM being marked as supporting those command numbers at
> acpi_nfit_register_dimms() time. The DSM function mask is only used for
> ND_CMD_CALL support of non-NVDIMM_FAMILY_INTEL devices.
>
> Fixes: 31eca76ba2fc ("nfit, libnvdimm: limited/whitelisted dimm command...")
> Cc: 
> Link: https://github.com/pmem/ndctl/issues/78
> Reported-by: Sujith Pandel 
> Signed-off-by: Dan Williams 
> ---
> Sujith, this is a larger change than what you originally tested, but it
> should behave the same. I wanted to consolidate all the code that
> handles Linux command number to DIMM _DSM function number translation.
>
> If you have a chance to re-test with this it would be much appreciated.
>
> Thanks for the report!
>
>  drivers/acpi/nfit/core.c |   43 +--
>  1 file changed, 29 insertions(+), 14 deletions(-)
>
> diff --git a/drivers/acpi/nfit/core.c b/drivers/acpi/nfit/core.c
> index 790691d9a982..d5d64e90ae71 100644
> --- a/drivers/acpi/nfit/core.c
> +++ b/drivers/acpi/nfit/core.c
> @@ -409,6 +409,29 @@ static bool payload_dumpable(struct nvdimm *nvdimm, 
> unsigned int func)
>   return true;
>  }
>  
> +static int cmd_to_func(struct nvdimm *nvdimm, unsigned int cmd,
> + struct nd_cmd_pkg *call_pkg)
> +{
> + struct nfit_mem *nfit_mem = nvdimm_provider_data(nvdimm);

Minor nit: Seems like the function could take an nfit_mem parameter instead of 
an nvdimm.

> +
> + if (cmd == ND_CMD_CALL) {
> + int i;
> +
> + if (call_pkg && nfit_mem->family != call_pkg->nd_family)
> + return -ENOTTY;
> +
> + for (i = 0; i < ARRAY_SIZE(call_pkg->nd_reserved2); i++)
> + if (call_pkg->nd_reserved2[i])
> + return -EINVAL;
> + return call_pkg->nd_command;
> + }
> +
> + /* Linux ND commands == NVDIMM_FAMILY_INTEL function numbers */
> + if (nfit_mem->family == NVDIMM_FAMILY_INTEL)
> + return cmd;
> + return 0;

Function zero?  Is that really the right thing to return here?

Cheers,
Jeff


Re: [PATCH] scsi: isci: initialize shost fully before calling scsi_add_host()

2019-01-08 Thread Jeff Moyer
Logan Gunthorpe  writes:

> scsi_mq_setup_tags(), which is called by scsi_add_host(), calculates
> the command size to allocate based on the prot_capabilities. In the
> isci driver, scsi_host_set_prot() is called after scsi_add_host()
> so the command size gets calculated to be smaller than it needs to be.
> Eventually, scsi_mq_init_request() locates the 'prot_sdb' after the command
> assuming it was sized correctly and a buffer overrun may occur.

[...]

> To prevent this, the calls to scsi_host_set_prot() are moved into
> isci_host_alloc() before the call to scsi_add_host(). Out of caution,
> also move the similar call to scsi_host_set_guard().
>
> Fixes: 3d2d75254915 ("[SCSI] isci: T10 DIF support")
> Link: 
> http://lkml.kernel.org/r/da851333-eadd-163a-8c78-e1f4ec5ec...@deltatee.com
> Signed-off-by: Logan Gunthorpe 
> Cc: Intel SCU Linux support 
> Cc: Artur Paszkiewicz 
> Cc: "James E.J. Bottomley" 
> Cc: "Martin K. Petersen" 
> Cc: Christoph Hellwig 
> Cc: Jens Axboe 
> Cc: Jeff Moyer 

Nice job, and excellent commit message.  We'll need a similar patch for
lpfc.

Reviewed-by: Jeff Moyer 

> ---
>  drivers/scsi/isci/init.c | 14 +++---
>  1 file changed, 7 insertions(+), 7 deletions(-)
>
> diff --git a/drivers/scsi/isci/init.c b/drivers/scsi/isci/init.c
> index 68b90c4f79a3..1727d0c71b12 100644
> --- a/drivers/scsi/isci/init.c
> +++ b/drivers/scsi/isci/init.c
> @@ -576,6 +576,13 @@ static struct isci_host *isci_host_alloc(struct pci_dev 
> *pdev, int id)
>   shost->max_lun = ~0;
>   shost->max_cmd_len = MAX_COMMAND_SIZE;
>  
> + /* turn on DIF support */
> + scsi_host_set_prot(shost,
> +SHOST_DIF_TYPE1_PROTECTION |
> +SHOST_DIF_TYPE2_PROTECTION |
> +SHOST_DIF_TYPE3_PROTECTION);
> + scsi_host_set_guard(shost, SHOST_DIX_GUARD_CRC);
> +
>   err = scsi_add_host(shost, >dev);
>   if (err)
>   goto err_shost;
> @@ -663,13 +670,6 @@ static int isci_pci_probe(struct pci_dev *pdev, const 
> struct pci_device_id *id)
>   goto err_host_alloc;
>   }
>   pci_info->hosts[i] = h;
> -
> - /* turn on DIF support */
> - scsi_host_set_prot(to_shost(h),
> -SHOST_DIF_TYPE1_PROTECTION |
> -SHOST_DIF_TYPE2_PROTECTION |
> -SHOST_DIF_TYPE3_PROTECTION);
> - scsi_host_set_guard(to_shost(h), SHOST_DIX_GUARD_CRC);
>   }
>  
>   err = isci_setup_interrupts(pdev);


Re: Regression in v5.0-rc1: Panic at boot

2019-01-08 Thread Jeff Moyer
Hi, Logan,

Logan Gunthorpe  writes:

> Hey,
>
> I found a regression in v5.0-rc1 this morning. My system panics on boot.
> I've attached a log of the panic.
>
> I bisected to find the problematic commit is:
>
> Fixes: 9d037ad707ed ("block: remove req->timeout_list")
>
> But it makes no sense to me why this commit would cause a problem like
> this. I've attached a bisect log. I've also tested v5.0-rc1 with this
> commit reverted and that boots fine.
>
> The traceback seems to indicate the problem is on the bip_get_seed()
> line in t10_pi_complete(). Which suggests that bio_integrity() is
> returning NULL.

Does your hardware actually support protection information?

-Jeff


Re: [PATCH] aio: Convert ioctx_table to XArray

2018-12-11 Thread Jeff Moyer
Jens Axboe  writes:

> On 12/11/18 11:02 AM, Jeff Moyer wrote:
>> Matthew Wilcox  writes:
>> 
>>> On Tue, Dec 11, 2018 at 12:21:52PM -0500, Jeff Moyer wrote:
>>>> I'm going to submit this version formally.  If you're interested in
>>>> converting the ioctx_table to xarray, you can do that separately from a
>>>> security fix.  I would include a performance analysis with that patch,
>>>> though.  The idea of using a radix tree for the ioctx table was
>>>> discarded due to performance reasons--see commit db446a08c23d5 ("aio:
>>>> convert the ioctx list to table lookup v3").  I suspect using the xarray
>>>> will perform similarly.
>>>
>>> There's a big difference between Octavian's patch and mine.  That patch
>>> indexed into the radix tree by 'ctx_id' directly, which was pretty
>>> much guaranteed to exhibit some close-to-worst-case behaviour from the
>>> radix tree due to IDs being sparsely assigned.  My patch uses the ring
>>> ID which _we_ assigned, and so is nicely behaved, being usually a very
>>> small integer.
>> 
>> OK, good to know.  I obviously didn't look too closely at the two.
>> 
>>> What performance analysis would you find compelling?  Octavian's original
>>> fio script:
>>>
>>>> rw=randrw; size=256k ;directory=/mnt/fio; ioengine=libaio; iodepth=1
>>>> blocksize=1024; numjobs=512; thread; loops=100
>>>>
>>>> on an EXT2 filesystem mounted on top of a ramdisk
>>>
>>> or something else?
>> 
>> I think the most common use case is a small number of ioctx-s, so I'd
>> like to see that use case not regress (that should be easy, right?).

Bah, I meant a small number of threads doing submit/getevents.

>> Kent, what were the tests you were using when doing this work?  Jens,
>> since you're doing performance work in this area now, are there any
>> particular test cases you care about?
>
> I can give it a spin, ioctx lookup is in the fast path, and for "classic"
> aio we do it twice for each IO...

Thanks!

-Jeff


Re: [PATCH] aio: Convert ioctx_table to XArray

2018-12-11 Thread Jeff Moyer
Matthew Wilcox  writes:

> On Tue, Dec 11, 2018 at 12:21:52PM -0500, Jeff Moyer wrote:
>> I'm going to submit this version formally.  If you're interested in
>> converting the ioctx_table to xarray, you can do that separately from a
>> security fix.  I would include a performance analysis with that patch,
>> though.  The idea of using a radix tree for the ioctx table was
>> discarded due to performance reasons--see commit db446a08c23d5 ("aio:
>> convert the ioctx list to table lookup v3").  I suspect using the xarray
>> will perform similarly.
>
> There's a big difference between Octavian's patch and mine.  That patch
> indexed into the radix tree by 'ctx_id' directly, which was pretty
> much guaranteed to exhibit some close-to-worst-case behaviour from the
> radix tree due to IDs being sparsely assigned.  My patch uses the ring
> ID which _we_ assigned, and so is nicely behaved, being usually a very
> small integer.

OK, good to know.  I obviously didn't look too closely at the two.

> What performance analysis would you find compelling?  Octavian's original
> fio script:
>
>> rw=randrw; size=256k ;directory=/mnt/fio; ioengine=libaio; iodepth=1
>> blocksize=1024; numjobs=512; thread; loops=100
>> 
>> on an EXT2 filesystem mounted on top of a ramdisk
>
> or something else?

I think the most common use case is a small number of ioctx-s, so I'd
like to see that use case not regress (that should be easy, right?).
Kent, what were the tests you were using when doing this work?  Jens,
since you're doing performance work in this area now, are there any
particular test cases you care about?

Cheers,
Jeff


Re: [PATCH] aio: Convert ioctx_table to XArray

2018-12-11 Thread Jeff Moyer
Jeff Moyer  writes:

> Jeff Moyer  writes:
>
>> Matthew Wilcox  writes:
>>
>>> This custom resizing array was vulnerable to a Spectre attack (speculating
>>> off the end of an array to a user-controlled offset).  The XArray is
>>> not vulnerable to Spectre as it always masks its lookups to be within
>>> the bounds of the array.
>>
>> I'm not a big fan of completely re-writing the code to fix this.  Isn't
>> the below patch sufficient?
>
> Too quick on the draw.  Here's a patch that compiles.  ;-)

Hi, Matthew,

I'm going to submit this version formally.  If you're interested in
converting the ioctx_table to xarray, you can do that separately from a
security fix.  I would include a performance analysis with that patch,
though.  The idea of using a radix tree for the ioctx table was
discarded due to performance reasons--see commit db446a08c23d5 ("aio:
convert the ioctx list to table lookup v3").  I suspect using the xarray
will perform similarly.

Cheers,
Jeff

> diff --git a/fs/aio.c b/fs/aio.c
> index 97f983592925..aac9659381d2 100644
> --- a/fs/aio.c
> +++ b/fs/aio.c
> @@ -45,6 +45,7 @@
>  
>  #include 
>  #include 
> +#include 
>  
>  #include "internal.h"
>  
> @@ -1038,6 +1039,7 @@ static struct kioctx *lookup_ioctx(unsigned long ctx_id)
>   if (!table || id >= table->nr)
>   goto out;
>  
> + id = array_index_nospec(id, table->nr);
>   ctx = rcu_dereference(table->table[id]);
>   if (ctx && ctx->user_id == ctx_id) {
>   if (percpu_ref_tryget_live(>users))
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-aio' in
> the body to majord...@kvack.org.  For more info on Linux AIO,
> see: http://www.kvack.org/aio/
> Don't email: mailto:"a...@kvack.org;>a...@kvack.org


Re: [PATCH] aio: Convert ioctx_table to XArray

2018-12-06 Thread Jeff Moyer
Jeff Moyer  writes:

> Matthew Wilcox  writes:
>
>> This custom resizing array was vulnerable to a Spectre attack (speculating
>> off the end of an array to a user-controlled offset).  The XArray is
>> not vulnerable to Spectre as it always masks its lookups to be within
>> the bounds of the array.
>
> I'm not a big fan of completely re-writing the code to fix this.  Isn't
> the below patch sufficient?

Too quick on the draw.  Here's a patch that compiles.  ;-)

Cheers,
Jeff

diff --git a/fs/aio.c b/fs/aio.c
index 97f983592925..aac9659381d2 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -45,6 +45,7 @@
 
 #include 
 #include 
+#include 
 
 #include "internal.h"
 
@@ -1038,6 +1039,7 @@ static struct kioctx *lookup_ioctx(unsigned long ctx_id)
if (!table || id >= table->nr)
goto out;
 
+   id = array_index_nospec(id, table->nr);
ctx = rcu_dereference(table->table[id]);
if (ctx && ctx->user_id == ctx_id) {
if (percpu_ref_tryget_live(>users))


Re: [PATCH] aio: Convert ioctx_table to XArray

2018-12-06 Thread Jeff Moyer
Jeff Moyer  writes:

> Matthew Wilcox  writes:
>
>> This custom resizing array was vulnerable to a Spectre attack (speculating
>> off the end of an array to a user-controlled offset).  The XArray is
>> not vulnerable to Spectre as it always masks its lookups to be within
>> the bounds of the array.
>
> I'm not a big fan of completely re-writing the code to fix this.  Isn't
> the below patch sufficient?

Too quick on the draw.  Here's a patch that compiles.  ;-)

Cheers,
Jeff

diff --git a/fs/aio.c b/fs/aio.c
index 97f983592925..aac9659381d2 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -45,6 +45,7 @@
 
 #include 
 #include 
+#include 
 
 #include "internal.h"
 
@@ -1038,6 +1039,7 @@ static struct kioctx *lookup_ioctx(unsigned long ctx_id)
if (!table || id >= table->nr)
goto out;
 
+   id = array_index_nospec(id, table->nr);
ctx = rcu_dereference(table->table[id]);
if (ctx && ctx->user_id == ctx_id) {
if (percpu_ref_tryget_live(>users))


Re: [PATCH] aio: Convert ioctx_table to XArray

2018-12-06 Thread Jeff Moyer
Matthew Wilcox  writes:

> This custom resizing array was vulnerable to a Spectre attack (speculating
> off the end of an array to a user-controlled offset).  The XArray is
> not vulnerable to Spectre as it always masks its lookups to be within
> the bounds of the array.

I'm not a big fan of completely re-writing the code to fix this.  Isn't
the below patch sufficient?

-Jeff

diff --git a/fs/aio.c b/fs/aio.c
index 97f983592925..9402ae0b63d5 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -1038,6 +1038,7 @@ static struct kioctx *lookup_ioctx(unsigned long ctx_id)
if (!table || id >= table->nr)
goto out;
 
+   id = array_index_nospec(index, table->nr);
ctx = rcu_dereference(table->table[id]);
if (ctx && ctx->user_id == ctx_id) {
if (percpu_ref_tryget_live(>users))


Re: [PATCH] aio: Convert ioctx_table to XArray

2018-12-06 Thread Jeff Moyer
Matthew Wilcox  writes:

> This custom resizing array was vulnerable to a Spectre attack (speculating
> off the end of an array to a user-controlled offset).  The XArray is
> not vulnerable to Spectre as it always masks its lookups to be within
> the bounds of the array.

I'm not a big fan of completely re-writing the code to fix this.  Isn't
the below patch sufficient?

-Jeff

diff --git a/fs/aio.c b/fs/aio.c
index 97f983592925..9402ae0b63d5 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -1038,6 +1038,7 @@ static struct kioctx *lookup_ioctx(unsigned long ctx_id)
if (!table || id >= table->nr)
goto out;
 
+   id = array_index_nospec(index, table->nr);
ctx = rcu_dereference(table->table[id]);
if (ctx && ctx->user_id == ctx_id) {
if (percpu_ref_tryget_live(>users))


Re: Very slow SYS_io_destroy()

2018-07-23 Thread Jeff Moyer
Alex Richman  writes:

> Hi,
>
> I'm seeing some weirdness with AIO, specifically SYS_io_destroy() is
> taking upwards of ~10 microseconds (~100 miliseconds) per call.
> How long is that call expected to take?  I can see from the source

Well, it waits for an RCU grace period.  I would not expect that to take
100ms, though.  What kernel version is this?

-Jeff


Re: Very slow SYS_io_destroy()

2018-07-23 Thread Jeff Moyer
Alex Richman  writes:

> Hi,
>
> I'm seeing some weirdness with AIO, specifically SYS_io_destroy() is
> taking upwards of ~10 microseconds (~100 miliseconds) per call.
> How long is that call expected to take?  I can see from the source

Well, it waits for an RCU grace period.  I would not expect that to take
100ms, though.  What kernel version is this?

-Jeff


Re: [PATCH v6 4/5] fs: blkdev set bio prio from kiocb prio

2018-05-22 Thread Jeff Moyer
adam.manzana...@wdc.com writes:

> From: Adam Manzanares <adam.manzana...@wdc.com>
>
> Now that kiocb has an ioprio field copy this over to the bio when it is
> created from the kiocb.
>
> Signed-off-by: Adam Manzanares <adam.manzana...@wdc.com>

Reviewed-by: Jeff Moyer <jmo...@redhat.com>

Thanks!
Jeff

> ---
>  fs/block_dev.c | 2 ++
>  1 file changed, 2 insertions(+)
>
> diff --git a/fs/block_dev.c b/fs/block_dev.c
> index 7ec920e27065..11ba99e79d2a 100644
> --- a/fs/block_dev.c
> +++ b/fs/block_dev.c
> @@ -216,6 +216,7 @@ __blkdev_direct_IO_simple(struct kiocb *iocb, struct 
> iov_iter *iter,
>   bio.bi_write_hint = iocb->ki_hint;
>   bio.bi_private = current;
>   bio.bi_end_io = blkdev_bio_end_io_simple;
> + bio.bi_ioprio = iocb->ki_ioprio;
>  
>   ret = bio_iov_iter_get_pages(, iter);
>   if (unlikely(ret))
> @@ -355,6 +356,7 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter 
> *iter, int nr_pages)
>   bio->bi_write_hint = iocb->ki_hint;
>   bio->bi_private = dio;
>   bio->bi_end_io = blkdev_bio_end_io;
> + bio->bi_ioprio = iocb->ki_ioprio;
>  
>   ret = bio_iov_iter_get_pages(bio, iter);
>   if (unlikely(ret)) {


Re: [PATCH v6 4/5] fs: blkdev set bio prio from kiocb prio

2018-05-22 Thread Jeff Moyer
adam.manzana...@wdc.com writes:

> From: Adam Manzanares 
>
> Now that kiocb has an ioprio field copy this over to the bio when it is
> created from the kiocb.
>
> Signed-off-by: Adam Manzanares 

Reviewed-by: Jeff Moyer 

Thanks!
Jeff

> ---
>  fs/block_dev.c | 2 ++
>  1 file changed, 2 insertions(+)
>
> diff --git a/fs/block_dev.c b/fs/block_dev.c
> index 7ec920e27065..11ba99e79d2a 100644
> --- a/fs/block_dev.c
> +++ b/fs/block_dev.c
> @@ -216,6 +216,7 @@ __blkdev_direct_IO_simple(struct kiocb *iocb, struct 
> iov_iter *iter,
>   bio.bi_write_hint = iocb->ki_hint;
>   bio.bi_private = current;
>   bio.bi_end_io = blkdev_bio_end_io_simple;
> + bio.bi_ioprio = iocb->ki_ioprio;
>  
>   ret = bio_iov_iter_get_pages(, iter);
>   if (unlikely(ret))
> @@ -355,6 +356,7 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter 
> *iter, int nr_pages)
>   bio->bi_write_hint = iocb->ki_hint;
>   bio->bi_private = dio;
>   bio->bi_end_io = blkdev_bio_end_io;
> + bio->bi_ioprio = iocb->ki_ioprio;
>  
>   ret = bio_iov_iter_get_pages(bio, iter);
>   if (unlikely(ret)) {


Re: [PATCH v5 4/5] fs: blkdev set bio prio from kiocb prio

2018-05-21 Thread Jeff Moyer
adam.manzana...@wdc.com writes:

> From: Adam Manzanares 
>
> Now that kiocb has an ioprio field copy this over to the bio when it is
> created from the kiocb.
>
> Signed-off-by: Adam Manzanares 
> ---
>  fs/block_dev.c | 1 +
>  1 file changed, 1 insertion(+)
>
> diff --git a/fs/block_dev.c b/fs/block_dev.c
> index 7ec920e27065..da1e94d2bb75 100644
> --- a/fs/block_dev.c
> +++ b/fs/block_dev.c
> @@ -355,6 +355,7 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter 
> *iter, int nr_pages)
>   bio->bi_write_hint = iocb->ki_hint;
>   bio->bi_private = dio;
>   bio->bi_end_io = blkdev_bio_end_io;
> + bio->bi_ioprio = iocb->ki_ioprio;
>  
>   ret = bio_iov_iter_get_pages(bio, iter);
>   if (unlikely(ret)) {

Forgot to mention, you should also convert __blkdev_direct_IO_simple.

-Jeff


Re: [PATCH v5 4/5] fs: blkdev set bio prio from kiocb prio

2018-05-21 Thread Jeff Moyer
adam.manzana...@wdc.com writes:

> From: Adam Manzanares 
>
> Now that kiocb has an ioprio field copy this over to the bio when it is
> created from the kiocb.
>
> Signed-off-by: Adam Manzanares 
> ---
>  fs/block_dev.c | 1 +
>  1 file changed, 1 insertion(+)
>
> diff --git a/fs/block_dev.c b/fs/block_dev.c
> index 7ec920e27065..da1e94d2bb75 100644
> --- a/fs/block_dev.c
> +++ b/fs/block_dev.c
> @@ -355,6 +355,7 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter 
> *iter, int nr_pages)
>   bio->bi_write_hint = iocb->ki_hint;
>   bio->bi_private = dio;
>   bio->bi_end_io = blkdev_bio_end_io;
> + bio->bi_ioprio = iocb->ki_ioprio;
>  
>   ret = bio_iov_iter_get_pages(bio, iter);
>   if (unlikely(ret)) {

Forgot to mention, you should also convert __blkdev_direct_IO_simple.

-Jeff


Re: [PATCH v5 3/5] fs: Add aio iopriority support

2018-05-21 Thread Jeff Moyer
adam.manzana...@wdc.com writes:

> From: Adam Manzanares <adam.manzana...@wdc.com>
>
> This is the per-I/O equivalent of the ioprio_set system call.
>
> When IOCB_FLAG_IOPRIO is set on the iocb aio_flags field, then we set the
> newly added kiocb ki_ioprio field to the value in the iocb aio_reqprio field.
>
> We set the blkdev bio iopriority unconditionally, so we need to guarantee the 
> kiocb is initialized properly. Added changes to the loopback driver and 
> init_sync_kiocb to achieve this.
>
> This patch depends on block: add ioprio_check_cap function.
>
> Signed-off-by: Adam Manzanares <adam.manzana...@wdc.com>

Reviewed-by: Jeff Moyer <jmo...@redhat.com>


> ---
>  drivers/block/loop.c |  3 +++
>  fs/aio.c | 16 
>  include/linux/fs.h   |  3 +++
>  include/uapi/linux/aio_abi.h |  1 +
>  4 files changed, 23 insertions(+)
>
> diff --git a/drivers/block/loop.c b/drivers/block/loop.c
> index 5d4e31655d96..dd98dfd97f5e 100644
> --- a/drivers/block/loop.c
> +++ b/drivers/block/loop.c
> @@ -76,6 +76,8 @@
>  #include 
>  #include 
>  #include 
> +#include 
> +
>  #include "loop.h"
>  
>  #include 
> @@ -559,6 +561,7 @@ static int lo_rw_aio(struct loop_device *lo, struct 
> loop_cmd *cmd,
>   cmd->iocb.ki_filp = file;
>   cmd->iocb.ki_complete = lo_rw_aio_complete;
>   cmd->iocb.ki_flags = IOCB_DIRECT;
> + cmd->iocb.ki_ioprio = IOPRIO_PRIO_VALUE(IOPRIO_CLASS_NONE, 0);
>   if (cmd->css)
>   kthread_associate_blkcg(cmd->css);
>  
> diff --git a/fs/aio.c b/fs/aio.c
> index f3eae5d5771b..44b4572be524 100644
> --- a/fs/aio.c
> +++ b/fs/aio.c
> @@ -1451,6 +1451,22 @@ static int aio_prep_rw(struct kiocb *req, struct iocb 
> *iocb)
>   if (iocb->aio_flags & IOCB_FLAG_RESFD)
>   req->ki_flags |= IOCB_EVENTFD;
>   req->ki_hint = file_write_hint(req->ki_filp);
> + if (iocb->aio_flags & IOCB_FLAG_IOPRIO) {
> + /*
> +  * If the IOCB_FLAG_IOPRIO flag of aio_flags is set, then
> +  * aio_reqprio is interpreted as an I/O scheduling
> +  * class and priority.
> +  */
> + ret = ioprio_check_cap(iocb->aio_reqprio);
> + if (ret) {
> + pr_debug("aio ioprio check cap error\n");
> + return -EINVAL;
> + }
> +
> + req->ki_ioprio = iocb->aio_reqprio;
> + } else
> + req->ki_ioprio = IOPRIO_PRIO_VALUE(IOPRIO_CLASS_NONE, 0);
> +
>   ret = kiocb_set_rw_flags(req, iocb->aio_rw_flags);
>   if (unlikely(ret))
>   fput(req->ki_filp);
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 50de40dbbb85..73b749ed3ea1 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -36,6 +36,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #include 
>  #include 
> @@ -302,6 +303,7 @@ struct kiocb {
>   void*private;
>   int ki_flags;
>   u16 ki_hint;
> + u16 ki_ioprio; /* See linux/ioprio.h */
>  } __randomize_layout;
>  
>  static inline bool is_sync_kiocb(struct kiocb *kiocb)
> @@ -1942,6 +1944,7 @@ static inline void init_sync_kiocb(struct kiocb *kiocb, 
> struct file *filp)
>   .ki_filp = filp,
>   .ki_flags = iocb_flags(filp),
>   .ki_hint = ki_hint_validate(file_write_hint(filp)),
> + .ki_ioprio = IOPRIO_PRIO_VALUE(IOPRIO_CLASS_NONE, 0),
>   };
>  }
>  
> diff --git a/include/uapi/linux/aio_abi.h b/include/uapi/linux/aio_abi.h
> index 2c0a3415beee..d4e768d55d14 100644
> --- a/include/uapi/linux/aio_abi.h
> +++ b/include/uapi/linux/aio_abi.h
> @@ -55,6 +55,7 @@ enum {
>   *   is valid.
>   */
>  #define IOCB_FLAG_RESFD  (1 << 0)
> +#define IOCB_FLAG_IOPRIO (1 << 1)
>  
>  /* read() from /dev/aio returns these structures. */
>  struct io_event {


Re: [PATCH v5 3/5] fs: Add aio iopriority support

2018-05-21 Thread Jeff Moyer
adam.manzana...@wdc.com writes:

> From: Adam Manzanares 
>
> This is the per-I/O equivalent of the ioprio_set system call.
>
> When IOCB_FLAG_IOPRIO is set on the iocb aio_flags field, then we set the
> newly added kiocb ki_ioprio field to the value in the iocb aio_reqprio field.
>
> We set the blkdev bio iopriority unconditionally, so we need to guarantee the 
> kiocb is initialized properly. Added changes to the loopback driver and 
> init_sync_kiocb to achieve this.
>
> This patch depends on block: add ioprio_check_cap function.
>
> Signed-off-by: Adam Manzanares 

Reviewed-by: Jeff Moyer 


> ---
>  drivers/block/loop.c |  3 +++
>  fs/aio.c | 16 
>  include/linux/fs.h   |  3 +++
>  include/uapi/linux/aio_abi.h |  1 +
>  4 files changed, 23 insertions(+)
>
> diff --git a/drivers/block/loop.c b/drivers/block/loop.c
> index 5d4e31655d96..dd98dfd97f5e 100644
> --- a/drivers/block/loop.c
> +++ b/drivers/block/loop.c
> @@ -76,6 +76,8 @@
>  #include 
>  #include 
>  #include 
> +#include 
> +
>  #include "loop.h"
>  
>  #include 
> @@ -559,6 +561,7 @@ static int lo_rw_aio(struct loop_device *lo, struct 
> loop_cmd *cmd,
>   cmd->iocb.ki_filp = file;
>   cmd->iocb.ki_complete = lo_rw_aio_complete;
>   cmd->iocb.ki_flags = IOCB_DIRECT;
> + cmd->iocb.ki_ioprio = IOPRIO_PRIO_VALUE(IOPRIO_CLASS_NONE, 0);
>   if (cmd->css)
>   kthread_associate_blkcg(cmd->css);
>  
> diff --git a/fs/aio.c b/fs/aio.c
> index f3eae5d5771b..44b4572be524 100644
> --- a/fs/aio.c
> +++ b/fs/aio.c
> @@ -1451,6 +1451,22 @@ static int aio_prep_rw(struct kiocb *req, struct iocb 
> *iocb)
>   if (iocb->aio_flags & IOCB_FLAG_RESFD)
>   req->ki_flags |= IOCB_EVENTFD;
>   req->ki_hint = file_write_hint(req->ki_filp);
> + if (iocb->aio_flags & IOCB_FLAG_IOPRIO) {
> + /*
> +  * If the IOCB_FLAG_IOPRIO flag of aio_flags is set, then
> +  * aio_reqprio is interpreted as an I/O scheduling
> +  * class and priority.
> +  */
> + ret = ioprio_check_cap(iocb->aio_reqprio);
> + if (ret) {
> + pr_debug("aio ioprio check cap error\n");
> + return -EINVAL;
> + }
> +
> + req->ki_ioprio = iocb->aio_reqprio;
> + } else
> + req->ki_ioprio = IOPRIO_PRIO_VALUE(IOPRIO_CLASS_NONE, 0);
> +
>   ret = kiocb_set_rw_flags(req, iocb->aio_rw_flags);
>   if (unlikely(ret))
>   fput(req->ki_filp);
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 50de40dbbb85..73b749ed3ea1 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -36,6 +36,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #include 
>  #include 
> @@ -302,6 +303,7 @@ struct kiocb {
>   void*private;
>   int ki_flags;
>   u16 ki_hint;
> + u16 ki_ioprio; /* See linux/ioprio.h */
>  } __randomize_layout;
>  
>  static inline bool is_sync_kiocb(struct kiocb *kiocb)
> @@ -1942,6 +1944,7 @@ static inline void init_sync_kiocb(struct kiocb *kiocb, 
> struct file *filp)
>   .ki_filp = filp,
>   .ki_flags = iocb_flags(filp),
>   .ki_hint = ki_hint_validate(file_write_hint(filp)),
> + .ki_ioprio = IOPRIO_PRIO_VALUE(IOPRIO_CLASS_NONE, 0),
>   };
>  }
>  
> diff --git a/include/uapi/linux/aio_abi.h b/include/uapi/linux/aio_abi.h
> index 2c0a3415beee..d4e768d55d14 100644
> --- a/include/uapi/linux/aio_abi.h
> +++ b/include/uapi/linux/aio_abi.h
> @@ -55,6 +55,7 @@ enum {
>   *   is valid.
>   */
>  #define IOCB_FLAG_RESFD  (1 << 0)
> +#define IOCB_FLAG_IOPRIO (1 << 1)
>  
>  /* read() from /dev/aio returns these structures. */
>  struct io_event {


Re: [PATCH v5 5/5] fs: iomap dio set bio prio from kiocb prio

2018-05-21 Thread Jeff Moyer
adam.manzana...@wdc.com writes:

> From: Adam Manzanares <adam.manzana...@wdc.com>
>
> Now that kiocb has an ioprio field copy this over to the bio when it is
> created from the kiocb during direct IO.
>
> Signed-off-by: Adam Manzanares <adam.manzana...@wdc.com>

Reviewed-by: Jeff Moyer <jmo...@redhat.com>


> ---
>  fs/iomap.c | 1 +
>  1 file changed, 1 insertion(+)
>
> diff --git a/fs/iomap.c b/fs/iomap.c
> index afd163586aa0..65aae194aeca 100644
> --- a/fs/iomap.c
> +++ b/fs/iomap.c
> @@ -919,6 +919,7 @@ iomap_dio_actor(struct inode *inode, loff_t pos, loff_t 
> length,
>   bio->bi_iter.bi_sector =
>   (iomap->addr + pos - iomap->offset) >> 9;
>   bio->bi_write_hint = dio->iocb->ki_hint;
> + bio->bi_ioprio = dio->iocb->ki_ioprio;
>   bio->bi_private = dio;
>   bio->bi_end_io = iomap_dio_bio_end_io;


Re: [PATCH v5 5/5] fs: iomap dio set bio prio from kiocb prio

2018-05-21 Thread Jeff Moyer
adam.manzana...@wdc.com writes:

> From: Adam Manzanares 
>
> Now that kiocb has an ioprio field copy this over to the bio when it is
> created from the kiocb during direct IO.
>
> Signed-off-by: Adam Manzanares 

Reviewed-by: Jeff Moyer 


> ---
>  fs/iomap.c | 1 +
>  1 file changed, 1 insertion(+)
>
> diff --git a/fs/iomap.c b/fs/iomap.c
> index afd163586aa0..65aae194aeca 100644
> --- a/fs/iomap.c
> +++ b/fs/iomap.c
> @@ -919,6 +919,7 @@ iomap_dio_actor(struct inode *inode, loff_t pos, loff_t 
> length,
>   bio->bi_iter.bi_sector =
>   (iomap->addr + pos - iomap->offset) >> 9;
>   bio->bi_write_hint = dio->iocb->ki_hint;
> + bio->bi_ioprio = dio->iocb->ki_ioprio;
>   bio->bi_private = dio;
>   bio->bi_end_io = iomap_dio_bio_end_io;


Re: [PATCH v5 4/5] fs: blkdev set bio prio from kiocb prio

2018-05-21 Thread Jeff Moyer
adam.manzana...@wdc.com writes:

> From: Adam Manzanares <adam.manzana...@wdc.com>
>
> Now that kiocb has an ioprio field copy this over to the bio when it is
> created from the kiocb.
>
> Signed-off-by: Adam Manzanares <adam.manzana...@wdc.com>

Reviewed-by: Jeff Moyer <jmo...@redhat.com>


> ---
>  fs/block_dev.c | 1 +
>  1 file changed, 1 insertion(+)
>
> diff --git a/fs/block_dev.c b/fs/block_dev.c
> index 7ec920e27065..da1e94d2bb75 100644
> --- a/fs/block_dev.c
> +++ b/fs/block_dev.c
> @@ -355,6 +355,7 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter 
> *iter, int nr_pages)
>   bio->bi_write_hint = iocb->ki_hint;
>   bio->bi_private = dio;
>   bio->bi_end_io = blkdev_bio_end_io;
> + bio->bi_ioprio = iocb->ki_ioprio;
>  
>   ret = bio_iov_iter_get_pages(bio, iter);
>   if (unlikely(ret)) {


Re: [PATCH v5 4/5] fs: blkdev set bio prio from kiocb prio

2018-05-21 Thread Jeff Moyer
adam.manzana...@wdc.com writes:

> From: Adam Manzanares 
>
> Now that kiocb has an ioprio field copy this over to the bio when it is
> created from the kiocb.
>
> Signed-off-by: Adam Manzanares 

Reviewed-by: Jeff Moyer 


> ---
>  fs/block_dev.c | 1 +
>  1 file changed, 1 insertion(+)
>
> diff --git a/fs/block_dev.c b/fs/block_dev.c
> index 7ec920e27065..da1e94d2bb75 100644
> --- a/fs/block_dev.c
> +++ b/fs/block_dev.c
> @@ -355,6 +355,7 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter 
> *iter, int nr_pages)
>   bio->bi_write_hint = iocb->ki_hint;
>   bio->bi_private = dio;
>   bio->bi_end_io = blkdev_bio_end_io;
> + bio->bi_ioprio = iocb->ki_ioprio;
>  
>   ret = bio_iov_iter_get_pages(bio, iter);
>   if (unlikely(ret)) {


Re: [PATCH v5 1/5] block: add ioprio_check_cap function

2018-05-21 Thread Jeff Moyer
adam.manzana...@wdc.com writes:

> From: Adam Manzanares <adam.manzana...@wdc.com>
>
> Aio per command iopriority support introduces a second interface between
> userland and the kernel capable of passing iopriority. The aio interface also
> needs the ability to verify that the submitting context has sufficient
> priviledges to submit IOPRIO_RT commands. This patch creates the
> ioprio_check_cap function to be used by the ioprio_set system call and also by
> the aio interface.
>
> Signed-off-by: Adam Manzanares <adam.manzana...@wdc.com>
> Reviewed-by: Christoph Hellwig <h...@lst.de>

Reviewed-by: Jeff Moyer <jmo...@redhat.com>


> ---
>  block/ioprio.c | 22 --
>  include/linux/ioprio.h |  2 ++
>  2 files changed, 18 insertions(+), 6 deletions(-)
>
> diff --git a/block/ioprio.c b/block/ioprio.c
> index 6f5d0b6625e3..f9821080c92c 100644
> --- a/block/ioprio.c
> +++ b/block/ioprio.c
> @@ -61,15 +61,10 @@ int set_task_ioprio(struct task_struct *task, int ioprio)
>  }
>  EXPORT_SYMBOL_GPL(set_task_ioprio);
>  
> -SYSCALL_DEFINE3(ioprio_set, int, which, int, who, int, ioprio)
> +int ioprio_check_cap(int ioprio)
>  {
>   int class = IOPRIO_PRIO_CLASS(ioprio);
>   int data = IOPRIO_PRIO_DATA(ioprio);
> - struct task_struct *p, *g;
> - struct user_struct *user;
> - struct pid *pgrp;
> - kuid_t uid;
> - int ret;
>  
>   switch (class) {
>   case IOPRIO_CLASS_RT:
> @@ -92,6 +87,21 @@ SYSCALL_DEFINE3(ioprio_set, int, which, int, who, int, 
> ioprio)
>   return -EINVAL;
>   }
>  
> + return 0;
> +}
> +
> +SYSCALL_DEFINE3(ioprio_set, int, which, int, who, int, ioprio)
> +{
> + struct task_struct *p, *g;
> + struct user_struct *user;
> + struct pid *pgrp;
> + kuid_t uid;
> + int ret;
> +
> + ret = ioprio_check_cap(ioprio);
> + if (ret)
> + return ret;
> +
>   ret = -ESRCH;
>   rcu_read_lock();
>   switch (which) {
> diff --git a/include/linux/ioprio.h b/include/linux/ioprio.h
> index 627efac73e6d..4a28cec49ec3 100644
> --- a/include/linux/ioprio.h
> +++ b/include/linux/ioprio.h
> @@ -77,4 +77,6 @@ extern int ioprio_best(unsigned short aprio, unsigned short 
> bprio);
>  
>  extern int set_task_ioprio(struct task_struct *task, int ioprio);
>  
> +extern int ioprio_check_cap(int ioprio);
> +
>  #endif


Re: [PATCH v5 1/5] block: add ioprio_check_cap function

2018-05-21 Thread Jeff Moyer
adam.manzana...@wdc.com writes:

> From: Adam Manzanares 
>
> Aio per command iopriority support introduces a second interface between
> userland and the kernel capable of passing iopriority. The aio interface also
> needs the ability to verify that the submitting context has sufficient
> priviledges to submit IOPRIO_RT commands. This patch creates the
> ioprio_check_cap function to be used by the ioprio_set system call and also by
> the aio interface.
>
> Signed-off-by: Adam Manzanares 
> Reviewed-by: Christoph Hellwig 

Reviewed-by: Jeff Moyer 


> ---
>  block/ioprio.c | 22 --
>  include/linux/ioprio.h |  2 ++
>  2 files changed, 18 insertions(+), 6 deletions(-)
>
> diff --git a/block/ioprio.c b/block/ioprio.c
> index 6f5d0b6625e3..f9821080c92c 100644
> --- a/block/ioprio.c
> +++ b/block/ioprio.c
> @@ -61,15 +61,10 @@ int set_task_ioprio(struct task_struct *task, int ioprio)
>  }
>  EXPORT_SYMBOL_GPL(set_task_ioprio);
>  
> -SYSCALL_DEFINE3(ioprio_set, int, which, int, who, int, ioprio)
> +int ioprio_check_cap(int ioprio)
>  {
>   int class = IOPRIO_PRIO_CLASS(ioprio);
>   int data = IOPRIO_PRIO_DATA(ioprio);
> - struct task_struct *p, *g;
> - struct user_struct *user;
> - struct pid *pgrp;
> - kuid_t uid;
> - int ret;
>  
>   switch (class) {
>   case IOPRIO_CLASS_RT:
> @@ -92,6 +87,21 @@ SYSCALL_DEFINE3(ioprio_set, int, which, int, who, int, 
> ioprio)
>   return -EINVAL;
>   }
>  
> + return 0;
> +}
> +
> +SYSCALL_DEFINE3(ioprio_set, int, which, int, who, int, ioprio)
> +{
> + struct task_struct *p, *g;
> + struct user_struct *user;
> + struct pid *pgrp;
> + kuid_t uid;
> + int ret;
> +
> + ret = ioprio_check_cap(ioprio);
> + if (ret)
> + return ret;
> +
>   ret = -ESRCH;
>   rcu_read_lock();
>   switch (which) {
> diff --git a/include/linux/ioprio.h b/include/linux/ioprio.h
> index 627efac73e6d..4a28cec49ec3 100644
> --- a/include/linux/ioprio.h
> +++ b/include/linux/ioprio.h
> @@ -77,4 +77,6 @@ extern int ioprio_best(unsigned short aprio, unsigned short 
> bprio);
>  
>  extern int set_task_ioprio(struct task_struct *task, int ioprio);
>  
> +extern int ioprio_check_cap(int ioprio);
> +
>  #endif


Re: [PATCH v5 0/5] AIO add per-command iopriority

2018-05-21 Thread Jeff Moyer
Hi, Adam,

adam.manzana...@wdc.com writes:

> From: Adam Manzanares 
>
> This is the per-I/O equivalent of the ioprio_set system call.
> See the following link for performance implications on a SATA HDD:
> https://lkml.org/lkml/2016/12/6/495
>
> First patch factors ioprio_check_cap function out of ioprio_set system call to
> also be used by the aio ioprio interface.
>
> Second patch converts kiocb ki_hint field to a u16 to avoid kiocb bloat.
>
> Third patch passes ioprio hint from aio iocb to kiocb and inititalizes kiocb
> ioprio value appropriately when it is not explicitly set.
>
> Fourth patch enables the feature for blkdev.
>
> Fifth patch enables the feature for iomap direct IO
>
> Note: this work is based on top of linux-vfs/for-next

I'll cook up a libaio test case.  Can you put together a man-pages
update for this?

Thanks!
Jeff


Re: [PATCH v5 0/5] AIO add per-command iopriority

2018-05-21 Thread Jeff Moyer
Hi, Adam,

adam.manzana...@wdc.com writes:

> From: Adam Manzanares 
>
> This is the per-I/O equivalent of the ioprio_set system call.
> See the following link for performance implications on a SATA HDD:
> https://lkml.org/lkml/2016/12/6/495
>
> First patch factors ioprio_check_cap function out of ioprio_set system call to
> also be used by the aio ioprio interface.
>
> Second patch converts kiocb ki_hint field to a u16 to avoid kiocb bloat.
>
> Third patch passes ioprio hint from aio iocb to kiocb and inititalizes kiocb
> ioprio value appropriately when it is not explicitly set.
>
> Fourth patch enables the feature for blkdev.
>
> Fifth patch enables the feature for iomap direct IO
>
> Note: this work is based on top of linux-vfs/for-next

I'll cook up a libaio test case.  Can you put together a man-pages
update for this?

Thanks!
Jeff


Re: [RFC PATCH v1 0/6] use mm to manage NVDIMM (pmem) zone

2018-05-07 Thread Jeff Moyer
Dan Williams <dan.j.willi...@intel.com> writes:

> On Mon, May 7, 2018 at 12:08 PM, Jeff Moyer <jmo...@redhat.com> wrote:
>> Dan Williams <dan.j.willi...@intel.com> writes:
>>
>>> On Mon, May 7, 2018 at 11:46 AM, Matthew Wilcox <wi...@infradead.org> wrote:
>>>> On Mon, May 07, 2018 at 10:50:21PM +0800, Huaisheng Ye wrote:
>>>>> Traditionally, NVDIMMs are treated by mm(memory management) subsystem as
>>>>> DEVICE zone, which is a virtual zone and both its start and end of pfn
>>>>> are equal to 0, mm wouldn’t manage NVDIMM directly as DRAM, kernel uses
>>>>> corresponding drivers, which locate at \drivers\nvdimm\ and
>>>>> \drivers\acpi\nfit and fs, to realize NVDIMM memory alloc and free with
>>>>> memory hot plug implementation.
>>>>
>>>> You probably want to let linux-nvdimm know about this patch set.
>>>> Adding to the cc.
>>>
>>> Yes, thanks for that!
>>>
>>>> Also, I only received patch 0 and 4.  What happened
>>>> to 1-3,5 and 6?
>>>>
>>>>> With current kernel, many mm’s classical features like the buddy
>>>>> system, swap mechanism and page cache couldn’t be supported to NVDIMM.
>>>>> What we are doing is to expand kernel mm’s capacity to make it to handle
>>>>> NVDIMM like DRAM. Furthermore we make mm could treat DRAM and NVDIMM
>>>>> separately, that means mm can only put the critical pages to NVDIMM
>>
>> Please define "critical pages."
>>
>>>>> zone, here we created a new zone type as NVM zone. That is to say for
>>>>> traditional(or normal) pages which would be stored at DRAM scope like
>>>>> Normal, DMA32 and DMA zones. But for the critical pages, which we hope
>>>>> them could be recovered from power fail or system crash, we make them
>>>>> to be persistent by storing them to NVM zone.
>>
>> [...]
>>
>>> I think adding yet one more mm-zone is the wrong direction. Instead,
>>> what we have been considering is a mechanism to allow a device-dax
>>> instance to be given back to the kernel as a distinct numa node
>>> managed by the VM. It seems it times to dust off those patches.
>>
>> What's the use case?
>
> Use NVDIMMs as System-RAM given their potentially higher capacity than
> DDR. The expectation in that case is that data is forfeit (not
> persisted) after a crash. Any persistent use case would need to go
> through the pmem driver, filesystem-dax or device-dax.

OK, but that sounds different from what was being proposed, here.  I'll
quote from above:

>>>>> But for the critical pages, which we hope them could be recovered
  ^
>>>>> from power fail or system crash, we make them to be persistent by
  ^^^
>>>>> storing them to NVM zone.

Hence my confusion.

Cheers,
Jeff


Re: [RFC PATCH v1 0/6] use mm to manage NVDIMM (pmem) zone

2018-05-07 Thread Jeff Moyer
Dan Williams  writes:

> On Mon, May 7, 2018 at 12:08 PM, Jeff Moyer  wrote:
>> Dan Williams  writes:
>>
>>> On Mon, May 7, 2018 at 11:46 AM, Matthew Wilcox  wrote:
>>>> On Mon, May 07, 2018 at 10:50:21PM +0800, Huaisheng Ye wrote:
>>>>> Traditionally, NVDIMMs are treated by mm(memory management) subsystem as
>>>>> DEVICE zone, which is a virtual zone and both its start and end of pfn
>>>>> are equal to 0, mm wouldn’t manage NVDIMM directly as DRAM, kernel uses
>>>>> corresponding drivers, which locate at \drivers\nvdimm\ and
>>>>> \drivers\acpi\nfit and fs, to realize NVDIMM memory alloc and free with
>>>>> memory hot plug implementation.
>>>>
>>>> You probably want to let linux-nvdimm know about this patch set.
>>>> Adding to the cc.
>>>
>>> Yes, thanks for that!
>>>
>>>> Also, I only received patch 0 and 4.  What happened
>>>> to 1-3,5 and 6?
>>>>
>>>>> With current kernel, many mm’s classical features like the buddy
>>>>> system, swap mechanism and page cache couldn’t be supported to NVDIMM.
>>>>> What we are doing is to expand kernel mm’s capacity to make it to handle
>>>>> NVDIMM like DRAM. Furthermore we make mm could treat DRAM and NVDIMM
>>>>> separately, that means mm can only put the critical pages to NVDIMM
>>
>> Please define "critical pages."
>>
>>>>> zone, here we created a new zone type as NVM zone. That is to say for
>>>>> traditional(or normal) pages which would be stored at DRAM scope like
>>>>> Normal, DMA32 and DMA zones. But for the critical pages, which we hope
>>>>> them could be recovered from power fail or system crash, we make them
>>>>> to be persistent by storing them to NVM zone.
>>
>> [...]
>>
>>> I think adding yet one more mm-zone is the wrong direction. Instead,
>>> what we have been considering is a mechanism to allow a device-dax
>>> instance to be given back to the kernel as a distinct numa node
>>> managed by the VM. It seems it times to dust off those patches.
>>
>> What's the use case?
>
> Use NVDIMMs as System-RAM given their potentially higher capacity than
> DDR. The expectation in that case is that data is forfeit (not
> persisted) after a crash. Any persistent use case would need to go
> through the pmem driver, filesystem-dax or device-dax.

OK, but that sounds different from what was being proposed, here.  I'll
quote from above:

>>>>> But for the critical pages, which we hope them could be recovered
  ^
>>>>> from power fail or system crash, we make them to be persistent by
  ^^^
>>>>> storing them to NVM zone.

Hence my confusion.

Cheers,
Jeff


Re: [RFC PATCH v1 0/6] use mm to manage NVDIMM (pmem) zone

2018-05-07 Thread Jeff Moyer
Dan Williams  writes:

> On Mon, May 7, 2018 at 11:46 AM, Matthew Wilcox  wrote:
>> On Mon, May 07, 2018 at 10:50:21PM +0800, Huaisheng Ye wrote:
>>> Traditionally, NVDIMMs are treated by mm(memory management) subsystem as
>>> DEVICE zone, which is a virtual zone and both its start and end of pfn
>>> are equal to 0, mm wouldn’t manage NVDIMM directly as DRAM, kernel uses
>>> corresponding drivers, which locate at \drivers\nvdimm\ and
>>> \drivers\acpi\nfit and fs, to realize NVDIMM memory alloc and free with
>>> memory hot plug implementation.
>>
>> You probably want to let linux-nvdimm know about this patch set.
>> Adding to the cc.
>
> Yes, thanks for that!
>
>> Also, I only received patch 0 and 4.  What happened
>> to 1-3,5 and 6?
>>
>>> With current kernel, many mm’s classical features like the buddy
>>> system, swap mechanism and page cache couldn’t be supported to NVDIMM.
>>> What we are doing is to expand kernel mm’s capacity to make it to handle
>>> NVDIMM like DRAM. Furthermore we make mm could treat DRAM and NVDIMM
>>> separately, that means mm can only put the critical pages to NVDIMM

Please define "critical pages."

>>> zone, here we created a new zone type as NVM zone. That is to say for
>>> traditional(or normal) pages which would be stored at DRAM scope like
>>> Normal, DMA32 and DMA zones. But for the critical pages, which we hope
>>> them could be recovered from power fail or system crash, we make them
>>> to be persistent by storing them to NVM zone.

[...]

> I think adding yet one more mm-zone is the wrong direction. Instead,
> what we have been considering is a mechanism to allow a device-dax
> instance to be given back to the kernel as a distinct numa node
> managed by the VM. It seems it times to dust off those patches.

What's the use case?  The above patch description seems to indicate an
intent to recover contents after a power loss.  Without seeing the whole
series, I'm not sure how that's accomplished in a safe or meaningful
way.

Huaisheng, could you provide a bit more background?

Thanks!
Jeff


Re: [RFC PATCH v1 0/6] use mm to manage NVDIMM (pmem) zone

2018-05-07 Thread Jeff Moyer
Dan Williams  writes:

> On Mon, May 7, 2018 at 11:46 AM, Matthew Wilcox  wrote:
>> On Mon, May 07, 2018 at 10:50:21PM +0800, Huaisheng Ye wrote:
>>> Traditionally, NVDIMMs are treated by mm(memory management) subsystem as
>>> DEVICE zone, which is a virtual zone and both its start and end of pfn
>>> are equal to 0, mm wouldn’t manage NVDIMM directly as DRAM, kernel uses
>>> corresponding drivers, which locate at \drivers\nvdimm\ and
>>> \drivers\acpi\nfit and fs, to realize NVDIMM memory alloc and free with
>>> memory hot plug implementation.
>>
>> You probably want to let linux-nvdimm know about this patch set.
>> Adding to the cc.
>
> Yes, thanks for that!
>
>> Also, I only received patch 0 and 4.  What happened
>> to 1-3,5 and 6?
>>
>>> With current kernel, many mm’s classical features like the buddy
>>> system, swap mechanism and page cache couldn’t be supported to NVDIMM.
>>> What we are doing is to expand kernel mm’s capacity to make it to handle
>>> NVDIMM like DRAM. Furthermore we make mm could treat DRAM and NVDIMM
>>> separately, that means mm can only put the critical pages to NVDIMM

Please define "critical pages."

>>> zone, here we created a new zone type as NVM zone. That is to say for
>>> traditional(or normal) pages which would be stored at DRAM scope like
>>> Normal, DMA32 and DMA zones. But for the critical pages, which we hope
>>> them could be recovered from power fail or system crash, we make them
>>> to be persistent by storing them to NVM zone.

[...]

> I think adding yet one more mm-zone is the wrong direction. Instead,
> what we have been considering is a mechanism to allow a device-dax
> instance to be given back to the kernel as a distinct numa node
> managed by the VM. It seems it times to dust off those patches.

What's the use case?  The above patch description seems to indicate an
intent to recover contents after a power loss.  Without seeing the whole
series, I'm not sure how that's accomplished in a safe or meaningful
way.

Huaisheng, could you provide a bit more background?

Thanks!
Jeff


Re: [PATCH v2] fs: Add aio iopriority support for block_dev

2018-05-03 Thread Jeff Moyer
Hi, Adam,

adam.manzana...@wdc.com writes:

> From: Adam Manzanares 
>
> This is the per-I/O equivalent of the ioprio_set system call.
>
> When IOCB_FLAG_IOPRIO is set on the iocb aio_flags field, then we set the
> newly added kiocb ki_ioprio field to the value in the iocb aio_reqprio field.
>
> When a bio is created for an aio request by the block dev we set the priority
> value of the bio to the user supplied value.
>
> See the following link for performance implications on a SATA HDD:
> https://lkml.org/lkml/2016/12/6/495
>
> Given that WRR support for NVME devices has patches floating around and it was
> discussed at LSFMM, we may soon have a lower latency storage device that can 
> make use of iopriorities. A per command iopriority interface seems timely 
> given these developments. 
>
> If we want to avoid bloating struct kiocb, I suggest we turn the private 
> field 
> into a union of the private and ki_ioprio field. It seems like the users of 
> the private field all use it at a point where we can yank the priority from 
> the kiocb before the private field is used. Comments and suggestions welcome.

The ioprio_set system call requires CAP_SYS_ADMIN for setting
IOPRIO_CLASS_RT.  I think we need similar checks here.

-Jeff

>
> v2: merge patches
> use IOCB_FLAG_IOPRIO
> validate intended use with IOCB_IOPRIO
> add linux-api and linux-block to cc
>
> Signed-off-by: Adam Manzanares 
> ---
>  fs/aio.c | 10 ++
>  fs/block_dev.c   |  2 ++
>  include/linux/fs.h   |  2 ++
>  include/uapi/linux/aio_abi.h |  1 +
>  4 files changed, 15 insertions(+)
>
> diff --git a/fs/aio.c b/fs/aio.c
> index 88d7927ffbc6..f36636d8ff2c 100644
> --- a/fs/aio.c
> +++ b/fs/aio.c
> @@ -1597,6 +1597,16 @@ static int io_submit_one(struct kioctx *ctx, struct 
> iocb __user *user_iocb,
>   req->common.ki_flags |= IOCB_EVENTFD;
>   }
>  
> + if (iocb->aio_flags & IOCB_FLAG_IOPRIO) {
> + /*
> +  * If the IOCB_FLAG_IOPRIO flag of aio_flags is set, then
> +  * aio_reqprio is interpreted as an I/O scheduling
> +  * class and priority.
> +  */
> + req->common.ki_ioprio = iocb->aio_reqprio;
> + req->common.ki_flags |= IOCB_IOPRIO;
> + }
> +
>   ret = kiocb_set_rw_flags(>common, iocb->aio_rw_flags);
>   if (unlikely(ret)) {
>   pr_debug("EINVAL: aio_rw_flags\n");
> diff --git a/fs/block_dev.c b/fs/block_dev.c
> index 7ec920e27065..970bef79caa6 100644
> --- a/fs/block_dev.c
> +++ b/fs/block_dev.c
> @@ -355,6 +355,8 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter 
> *iter, int nr_pages)
>   bio->bi_write_hint = iocb->ki_hint;
>   bio->bi_private = dio;
>   bio->bi_end_io = blkdev_bio_end_io;
> + if (iocb->ki_flags & IOCB_IOPRIO)
> + bio->bi_ioprio = iocb->ki_ioprio;
>  
>   ret = bio_iov_iter_get_pages(bio, iter);
>   if (unlikely(ret)) {
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 760d8da1b6c7..ab63ce720305 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -292,6 +292,7 @@ enum rw_hint {
>  #define IOCB_SYNC(1 << 5)
>  #define IOCB_WRITE   (1 << 6)
>  #define IOCB_NOWAIT  (1 << 7)
> +#define IOCB_IOPRIO  (1 << 8)
>  
>  struct kiocb {
>   struct file *ki_filp;
> @@ -300,6 +301,7 @@ struct kiocb {
>   void*private;
>   int ki_flags;
>   enum rw_hintki_hint;
> + u16 ki_ioprio; /* See linux/ioprio.h */
>  } __randomize_layout;
>  
>  static inline bool is_sync_kiocb(struct kiocb *kiocb)
> diff --git a/include/uapi/linux/aio_abi.h b/include/uapi/linux/aio_abi.h
> index a04adbc70ddf..d4593a6062ef 100644
> --- a/include/uapi/linux/aio_abi.h
> +++ b/include/uapi/linux/aio_abi.h
> @@ -54,6 +54,7 @@ enum {
>   *   is valid.
>   */
>  #define IOCB_FLAG_RESFD  (1 << 0)
> +#define IOCB_FLAG_IOPRIO (1 << 1)
>  
>  /* read() from /dev/aio returns these structures. */
>  struct io_event {


Re: [PATCH v2] fs: Add aio iopriority support for block_dev

2018-05-03 Thread Jeff Moyer
Hi, Adam,

adam.manzana...@wdc.com writes:

> From: Adam Manzanares 
>
> This is the per-I/O equivalent of the ioprio_set system call.
>
> When IOCB_FLAG_IOPRIO is set on the iocb aio_flags field, then we set the
> newly added kiocb ki_ioprio field to the value in the iocb aio_reqprio field.
>
> When a bio is created for an aio request by the block dev we set the priority
> value of the bio to the user supplied value.
>
> See the following link for performance implications on a SATA HDD:
> https://lkml.org/lkml/2016/12/6/495
>
> Given that WRR support for NVME devices has patches floating around and it was
> discussed at LSFMM, we may soon have a lower latency storage device that can 
> make use of iopriorities. A per command iopriority interface seems timely 
> given these developments. 
>
> If we want to avoid bloating struct kiocb, I suggest we turn the private 
> field 
> into a union of the private and ki_ioprio field. It seems like the users of 
> the private field all use it at a point where we can yank the priority from 
> the kiocb before the private field is used. Comments and suggestions welcome.

The ioprio_set system call requires CAP_SYS_ADMIN for setting
IOPRIO_CLASS_RT.  I think we need similar checks here.

-Jeff

>
> v2: merge patches
> use IOCB_FLAG_IOPRIO
> validate intended use with IOCB_IOPRIO
> add linux-api and linux-block to cc
>
> Signed-off-by: Adam Manzanares 
> ---
>  fs/aio.c | 10 ++
>  fs/block_dev.c   |  2 ++
>  include/linux/fs.h   |  2 ++
>  include/uapi/linux/aio_abi.h |  1 +
>  4 files changed, 15 insertions(+)
>
> diff --git a/fs/aio.c b/fs/aio.c
> index 88d7927ffbc6..f36636d8ff2c 100644
> --- a/fs/aio.c
> +++ b/fs/aio.c
> @@ -1597,6 +1597,16 @@ static int io_submit_one(struct kioctx *ctx, struct 
> iocb __user *user_iocb,
>   req->common.ki_flags |= IOCB_EVENTFD;
>   }
>  
> + if (iocb->aio_flags & IOCB_FLAG_IOPRIO) {
> + /*
> +  * If the IOCB_FLAG_IOPRIO flag of aio_flags is set, then
> +  * aio_reqprio is interpreted as an I/O scheduling
> +  * class and priority.
> +  */
> + req->common.ki_ioprio = iocb->aio_reqprio;
> + req->common.ki_flags |= IOCB_IOPRIO;
> + }
> +
>   ret = kiocb_set_rw_flags(>common, iocb->aio_rw_flags);
>   if (unlikely(ret)) {
>   pr_debug("EINVAL: aio_rw_flags\n");
> diff --git a/fs/block_dev.c b/fs/block_dev.c
> index 7ec920e27065..970bef79caa6 100644
> --- a/fs/block_dev.c
> +++ b/fs/block_dev.c
> @@ -355,6 +355,8 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter 
> *iter, int nr_pages)
>   bio->bi_write_hint = iocb->ki_hint;
>   bio->bi_private = dio;
>   bio->bi_end_io = blkdev_bio_end_io;
> + if (iocb->ki_flags & IOCB_IOPRIO)
> + bio->bi_ioprio = iocb->ki_ioprio;
>  
>   ret = bio_iov_iter_get_pages(bio, iter);
>   if (unlikely(ret)) {
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 760d8da1b6c7..ab63ce720305 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -292,6 +292,7 @@ enum rw_hint {
>  #define IOCB_SYNC(1 << 5)
>  #define IOCB_WRITE   (1 << 6)
>  #define IOCB_NOWAIT  (1 << 7)
> +#define IOCB_IOPRIO  (1 << 8)
>  
>  struct kiocb {
>   struct file *ki_filp;
> @@ -300,6 +301,7 @@ struct kiocb {
>   void*private;
>   int ki_flags;
>   enum rw_hintki_hint;
> + u16 ki_ioprio; /* See linux/ioprio.h */
>  } __randomize_layout;
>  
>  static inline bool is_sync_kiocb(struct kiocb *kiocb)
> diff --git a/include/uapi/linux/aio_abi.h b/include/uapi/linux/aio_abi.h
> index a04adbc70ddf..d4593a6062ef 100644
> --- a/include/uapi/linux/aio_abi.h
> +++ b/include/uapi/linux/aio_abi.h
> @@ -54,6 +54,7 @@ enum {
>   *   is valid.
>   */
>  #define IOCB_FLAG_RESFD  (1 << 0)
> +#define IOCB_FLAG_IOPRIO (1 << 1)
>  
>  /* read() from /dev/aio returns these structures. */
>  struct io_event {


Re: [PATCH] pmem: fix badblocks population for raw mode

2018-05-02 Thread Jeff Moyer
Toshi Kani <toshi.k...@hpe.com> writes:

> pmem_attach_disk() calls nvdimm_badblocks_populate() with resource
> range uninitialized in the case of raw mode.  This leads the pmem
> driver to hit MCE despite of ARS reporting the range bad.
>
> Initialize 'bb_res' for raw mode.
>
> Fixes: e8d513483300 ("memremap: change devm_memremap_pages interface to use 
> struct dev_pagemap")
> Signed-off-by: Toshi Kani <toshi.k...@hpe.com>
> Cc: Christoph Hellwig <h...@lst.de>
> Cc: Dan Williams <dan.j.willi...@intel.com>
> Cc: <sta...@vger.kernel.org>

Reviewed-by: Jeff Moyer <jmo...@redhat.com>

> ---
>  drivers/nvdimm/pmem.c |4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
> index 9d714926ecf5..2d7875209bce 100644
> --- a/drivers/nvdimm/pmem.c
> +++ b/drivers/nvdimm/pmem.c
> @@ -367,9 +367,11 @@ static int pmem_attach_disk(struct device *dev,
>   addr = devm_memremap_pages(dev, >pgmap);
>   pmem->pfn_flags |= PFN_MAP;
>   memcpy(_res, >pgmap.res, sizeof(bb_res));
> - } else
> + } else {
>   addr = devm_memremap(dev, pmem->phys_addr,
>   pmem->size, ARCH_MEMREMAP_PMEM);
> + memcpy(_res, res, sizeof(bb_res));
> + }
>  
>   /*
>* At release time the queue must be frozen before
> ___
> Linux-nvdimm mailing list
> linux-nvd...@lists.01.org
> https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH] pmem: fix badblocks population for raw mode

2018-05-02 Thread Jeff Moyer
Toshi Kani  writes:

> pmem_attach_disk() calls nvdimm_badblocks_populate() with resource
> range uninitialized in the case of raw mode.  This leads the pmem
> driver to hit MCE despite of ARS reporting the range bad.
>
> Initialize 'bb_res' for raw mode.
>
> Fixes: e8d513483300 ("memremap: change devm_memremap_pages interface to use 
> struct dev_pagemap")
> Signed-off-by: Toshi Kani 
> Cc: Christoph Hellwig 
> Cc: Dan Williams 
> Cc: 

Reviewed-by: Jeff Moyer 

> ---
>  drivers/nvdimm/pmem.c |4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
> index 9d714926ecf5..2d7875209bce 100644
> --- a/drivers/nvdimm/pmem.c
> +++ b/drivers/nvdimm/pmem.c
> @@ -367,9 +367,11 @@ static int pmem_attach_disk(struct device *dev,
>   addr = devm_memremap_pages(dev, >pgmap);
>   pmem->pfn_flags |= PFN_MAP;
>   memcpy(_res, >pgmap.res, sizeof(bb_res));
> - } else
> + } else {
>   addr = devm_memremap(dev, pmem->phys_addr,
>   pmem->size, ARCH_MEMREMAP_PMEM);
> + memcpy(_res, res, sizeof(bb_res));
> + }
>  
>   /*
>* At release time the queue must be frozen before
> ___
> Linux-nvdimm mailing list
> linux-nvd...@lists.01.org
> https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [Qemu-devel] [RFC v2 1/2] virtio: add pmem driver

2018-04-26 Thread Jeff Moyer
Pankaj Gupta  writes:

>> Ideally, qemu (seabios?) would advertise a platform capabilities
>> sub-table that doesn't fill in the flush bits.
>
> Could you please elaborate on this, how its related to disabling
> MAP_SYNC? We are not doing entire nvdimm device emulation. 

My mistake.  If you're not providing an NFIT, then you can ignore this
comment.  I'll have a closer look at your patches next week.

-Jeff


Re: [Qemu-devel] [RFC v2 1/2] virtio: add pmem driver

2018-04-26 Thread Jeff Moyer
Pankaj Gupta  writes:

>> Ideally, qemu (seabios?) would advertise a platform capabilities
>> sub-table that doesn't fill in the flush bits.
>
> Could you please elaborate on this, how its related to disabling
> MAP_SYNC? We are not doing entire nvdimm device emulation. 

My mistake.  If you're not providing an NFIT, then you can ignore this
comment.  I'll have a closer look at your patches next week.

-Jeff


Re: [RFC v2 1/2] virtio: add pmem driver

2018-04-26 Thread Jeff Moyer
Dan Williams  writes:

> [ adding Jeff directly since he has also been looking at
> infrastructure to track when MAP_SYNC should be disabled ]
>
> On Wed, Apr 25, 2018 at 7:21 AM, Dan Williams  
> wrote:
>> On Wed, Apr 25, 2018 at 4:24 AM, Pankaj Gupta  wrote:
>>> This patch adds virtio-pmem driver for KVM
>>> guest.
>>
>> Minor nit, please expand your changelog line wrapping to 72 columns.
>>
>>>
>>> Guest reads the persistent memory range
>>> information from Qemu over VIRTIO and registers
>>> it on nvdimm_bus. It also creates a nd_region
>>> object with the persistent memory range
>>> information so that existing 'nvdimm/pmem'
>>> driver can reserve this into system memory map.
>>> This way 'virtio-pmem' driver uses existing
>>> functionality of pmem driver to register persistent
>>> memory compatible for DAX capable filesystems.
>>
>> We need some additional enabling to disable MAP_SYNC for this

enable to disable... I like it!  ;-)

>> configuration. In other words, if fsync() is required then we must
>> disable the MAP_SYNC optimization. I think this should be a struct
>> dax_device property looked up at mmap time in each MAP_SYNC capable
>> ->mmap() file operation implementation.

Ideally, qemu (seabios?) would advertise a platform capabilities
sub-table that doesn't fill in the flush bits.

-Jeff


Re: [RFC v2 1/2] virtio: add pmem driver

2018-04-26 Thread Jeff Moyer
Dan Williams  writes:

> [ adding Jeff directly since he has also been looking at
> infrastructure to track when MAP_SYNC should be disabled ]
>
> On Wed, Apr 25, 2018 at 7:21 AM, Dan Williams  
> wrote:
>> On Wed, Apr 25, 2018 at 4:24 AM, Pankaj Gupta  wrote:
>>> This patch adds virtio-pmem driver for KVM
>>> guest.
>>
>> Minor nit, please expand your changelog line wrapping to 72 columns.
>>
>>>
>>> Guest reads the persistent memory range
>>> information from Qemu over VIRTIO and registers
>>> it on nvdimm_bus. It also creates a nd_region
>>> object with the persistent memory range
>>> information so that existing 'nvdimm/pmem'
>>> driver can reserve this into system memory map.
>>> This way 'virtio-pmem' driver uses existing
>>> functionality of pmem driver to register persistent
>>> memory compatible for DAX capable filesystems.
>>
>> We need some additional enabling to disable MAP_SYNC for this

enable to disable... I like it!  ;-)

>> configuration. In other words, if fsync() is required then we must
>> disable the MAP_SYNC optimization. I think this should be a struct
>> dax_device property looked up at mmap time in each MAP_SYNC capable
>> ->mmap() file operation implementation.

Ideally, qemu (seabios?) would advertise a platform capabilities
sub-table that doesn't fill in the flush bits.

-Jeff


Re: io_pgetevents & aio fsync V2

2018-04-06 Thread Jeff Moyer
Christoph Hellwig  writes:

> On Fri, Apr 06, 2018 at 04:16:30AM +0100, Al Viro wrote:
>> BTW, this is only tangentially related, but... does *anything* call
>> io_submit() for huge amounts of iocb?

I don't know.  If an application did that, as many I/Os as could fit
into the ring buffer would be submitted, and that's what gets returned
from the system call (the number of submitted iocbs).

>> Check in do_io_submit() is insane - "no more than MAX_LONG total of
>> _pointers_".  Compat variant goes for "no more than a page worth of
>> pointers" and there's a hard limit in ioctx_alloc() - we can't ever
>> get more than 8M slots in ring buffer...
>
> Logical upper bound for io_submit is nr_events passed to io_setup(),
> which is bound by aio_max_nr.  Except that we never actually check
> against nr_events (or max_reqs as it is known in kernel) in io_submit.
> Sigh..

io_submit_one calls aio_get_req which calls get_reqs_available, which is
what does the checking for an available ring buffer entry.

-Jeff


Re: io_pgetevents & aio fsync V2

2018-04-06 Thread Jeff Moyer
Christoph Hellwig  writes:

> On Fri, Apr 06, 2018 at 04:16:30AM +0100, Al Viro wrote:
>> BTW, this is only tangentially related, but... does *anything* call
>> io_submit() for huge amounts of iocb?

I don't know.  If an application did that, as many I/Os as could fit
into the ring buffer would be submitted, and that's what gets returned
from the system call (the number of submitted iocbs).

>> Check in do_io_submit() is insane - "no more than MAX_LONG total of
>> _pointers_".  Compat variant goes for "no more than a page worth of
>> pointers" and there's a hard limit in ioctx_alloc() - we can't ever
>> get more than 8M slots in ring buffer...
>
> Logical upper bound for io_submit is nr_events passed to io_setup(),
> which is bound by aio_max_nr.  Except that we never actually check
> against nr_events (or max_reqs as it is known in kernel) in io_submit.
> Sigh..

io_submit_one calls aio_get_req which calls get_reqs_available, which is
what does the checking for an available ring buffer entry.

-Jeff


Re: [PATCH 08/36] aio: implement io_pgetevents

2018-03-20 Thread Jeff Moyer
Christoph Hellwig <h...@lst.de> writes:

> On Tue, Mar 20, 2018 at 11:30:29AM -0400, Jeff Moyer wrote:
>> > In this commit:
>> >
>> > http://git.infradead.org/users/hch/libaio.git/commitdiff/49f77d595210393ce7b125cb00233cf737402f56
>> 
>> BTW, the man pages are shipped in the man-pages package, now.
>> Christoph, I can forward the change to Michael after this series goes
>> in.  Just let me know what's easiest for you.
>
> I'd be more than happy to send patch to Michael if we get these
> patches merged finally :)

hahaha, least of your problems, I know.  :)


Re: [PATCH 08/36] aio: implement io_pgetevents

2018-03-20 Thread Jeff Moyer
Christoph Hellwig  writes:

> On Tue, Mar 20, 2018 at 11:30:29AM -0400, Jeff Moyer wrote:
>> > In this commit:
>> >
>> > http://git.infradead.org/users/hch/libaio.git/commitdiff/49f77d595210393ce7b125cb00233cf737402f56
>> 
>> BTW, the man pages are shipped in the man-pages package, now.
>> Christoph, I can forward the change to Michael after this series goes
>> in.  Just let me know what's easiest for you.
>
> I'd be more than happy to send patch to Michael if we get these
> patches merged finally :)

hahaha, least of your problems, I know.  :)


Re: [PATCH 08/36] aio: implement io_pgetevents

2018-03-20 Thread Jeff Moyer
Christoph Hellwig  writes:

> On Mon, Mar 19, 2018 at 07:12:41PM -0700, Darrick J. Wong wrote:
>> > Note that unlike many other signal related calls we do not pass a sigmask
>> > size, as that would get us to 7 arguments, which aren't easily supported
>> > by the syscall infrastructure.  It seems a lot less painful to just add a
>> > new syscall variant in the unlikely case we're going to increase the
>> > sigset size.
>> 
>> I'm assuming there's a proposed manpage update for this somewhere? :)
>
> In this commit:
>
> http://git.infradead.org/users/hch/libaio.git/commitdiff/49f77d595210393ce7b125cb00233cf737402f56

BTW, the man pages are shipped in the man-pages package, now.
Christoph, I can forward the change to Michael after this series goes
in.  Just let me know what's easiest for you.

Cheers,
Jeff


Re: [PATCH 08/36] aio: implement io_pgetevents

2018-03-20 Thread Jeff Moyer
Christoph Hellwig  writes:

> On Mon, Mar 19, 2018 at 07:12:41PM -0700, Darrick J. Wong wrote:
>> > Note that unlike many other signal related calls we do not pass a sigmask
>> > size, as that would get us to 7 arguments, which aren't easily supported
>> > by the syscall infrastructure.  It seems a lot less painful to just add a
>> > new syscall variant in the unlikely case we're going to increase the
>> > sigset size.
>> 
>> I'm assuming there's a proposed manpage update for this somewhere? :)
>
> In this commit:
>
> http://git.infradead.org/users/hch/libaio.git/commitdiff/49f77d595210393ce7b125cb00233cf737402f56

BTW, the man pages are shipped in the man-pages package, now.
Christoph, I can forward the change to Michael after this series goes
in.  Just let me know what's easiest for you.

Cheers,
Jeff


Re: [PATCH 14/36] aio: implement IOCB_CMD_POLL

2018-03-05 Thread Jeff Moyer
Christoph Hellwig <h...@lst.de> writes:

> Simple one-shot poll through the io_submit() interface.  To poll for
> a file descriptor the application should submit an iocb of type
> IOCB_CMD_POLL.  It will poll the fd for the events specified in the
> the first 32 bits of the aio_buf field of the iocb.
>
> Unlike poll or epoll without EPOLLONESHOT this interface always works
> in one shot mode, that is once the iocb is completed, it will have to be
> resubmitted.
>
> Signed-off-by: Christoph Hellwig <h...@lst.de>

Also acked this one in the last posting.

Acked-by: Jeff Moyer <jmo...@redhat.com>


> ---
>  fs/aio.c | 102 
> +++
>  include/uapi/linux/aio_abi.h |   6 +--
>  2 files changed, 104 insertions(+), 4 deletions(-)
>
> diff --git a/fs/aio.c b/fs/aio.c
> index da87cbf7c67a..0bafc4975d51 100644
> --- a/fs/aio.c
> +++ b/fs/aio.c
> @@ -5,6 +5,7 @@
>   *   Implements an efficient asynchronous io interface.
>   *
>   *   Copyright 2000, 2001, 2002 Red Hat, Inc.  All Rights Reserved.
> + *   Copyright 2018 Christoph Hellwig.
>   *
>   *   See ../COPYING for licensing terms.
>   */
> @@ -156,9 +157,17 @@ struct kioctx {
>   unsignedid;
>  };
>  
> +struct poll_iocb {
> + struct file *file;
> + __poll_tevents;
> + struct wait_queue_head  *head;
> + struct wait_queue_entry wait;
> +};
> +
>  struct aio_kiocb {
>   union {
>   struct kiocbrw;
> + struct poll_iocbpoll;
>   };
>  
>   struct kioctx   *ki_ctx;
> @@ -1565,6 +1574,96 @@ static ssize_t aio_write(struct kiocb *req, struct 
> iocb *iocb, bool vectored,
>   return ret;
>  }
>  
> +static void __aio_complete_poll(struct poll_iocb *req, __poll_t mask)
> +{
> + fput(req->file);
> + aio_complete(container_of(req, struct aio_kiocb, poll),
> + mangle_poll(mask), 0);
> +}
> +
> +static void aio_complete_poll(struct poll_iocb *req, __poll_t mask)
> +{
> + struct aio_kiocb *iocb = container_of(req, struct aio_kiocb, poll);
> +
> + if (!(iocb->flags & AIO_IOCB_CANCELLED))
> + __aio_complete_poll(req, mask);
> +}
> +
> +static int aio_poll_cancel(struct kiocb *rw)
> +{
> + struct aio_kiocb *iocb = container_of(rw, struct aio_kiocb, rw);
> +
> + remove_wait_queue(iocb->poll.head, >poll.wait);
> + __aio_complete_poll(>poll, 0); /* no events to report */
> + return 0;
> +}
> +
> +static int aio_poll_wake(struct wait_queue_entry *wait, unsigned mode, int 
> sync,
> + void *key)
> +{
> + struct poll_iocb *req = container_of(wait, struct poll_iocb, wait);
> + struct file *file = req->file;
> + __poll_t mask = key_to_poll(key);
> +
> + assert_spin_locked(>head->lock);
> +
> + /* for instances that support it check for an event match first: */
> + if (mask && !(mask & req->events))
> + return 0;
> +
> + mask = vfs_poll_mask(file, req->events);
> + if (!mask)
> + return 0;
> +
> + __remove_wait_queue(req->head, >wait);
> + aio_complete_poll(req, mask);
> + return 1;
> +}
> +
> +static ssize_t aio_poll(struct aio_kiocb *aiocb, struct iocb *iocb)
> +{
> + struct poll_iocb *req = >poll;
> + unsigned long flags;
> + __poll_t mask;
> +
> + /* reject any unknown events outside the normal event mask. */
> + if ((u16)iocb->aio_buf != iocb->aio_buf)
> + return -EINVAL;
> + /* reject fields that are not defined for poll */
> + if (iocb->aio_offset || iocb->aio_nbytes || iocb->aio_rw_flags)
> + return -EINVAL;
> +
> + req->events = demangle_poll(iocb->aio_buf) | POLLERR | POLLHUP;
> + req->file = fget(iocb->aio_fildes);
> + if (unlikely(!req->file))
> + return -EBADF;
> +
> + req->head = vfs_get_poll_head(req->file, req->events);
> + if (!req->head) {
> + fput(req->file);
> + return -EINVAL; /* same as no support for IOCB_CMD_POLL */
> + }
> + if (IS_ERR(req->head)) {
> + mask = PTR_TO_POLL(req->head);
> + goto done;
> + }
> +
> + init_waitqueue_func_entry(>wait, aio_poll_wake);
> +
> + spin_lock_irqsave(>head->lock, flags);
> + mask = vfs_poll_mask(req->file, req->events);
> + if (!mask) {
> + __kiocb_set_cancel_fn(aiocb, aio_poll_cancel,
> +   

Re: [PATCH 14/36] aio: implement IOCB_CMD_POLL

2018-03-05 Thread Jeff Moyer
Christoph Hellwig  writes:

> Simple one-shot poll through the io_submit() interface.  To poll for
> a file descriptor the application should submit an iocb of type
> IOCB_CMD_POLL.  It will poll the fd for the events specified in the
> the first 32 bits of the aio_buf field of the iocb.
>
> Unlike poll or epoll without EPOLLONESHOT this interface always works
> in one shot mode, that is once the iocb is completed, it will have to be
> resubmitted.
>
> Signed-off-by: Christoph Hellwig 

Also acked this one in the last posting.

Acked-by: Jeff Moyer 


> ---
>  fs/aio.c | 102 
> +++
>  include/uapi/linux/aio_abi.h |   6 +--
>  2 files changed, 104 insertions(+), 4 deletions(-)
>
> diff --git a/fs/aio.c b/fs/aio.c
> index da87cbf7c67a..0bafc4975d51 100644
> --- a/fs/aio.c
> +++ b/fs/aio.c
> @@ -5,6 +5,7 @@
>   *   Implements an efficient asynchronous io interface.
>   *
>   *   Copyright 2000, 2001, 2002 Red Hat, Inc.  All Rights Reserved.
> + *   Copyright 2018 Christoph Hellwig.
>   *
>   *   See ../COPYING for licensing terms.
>   */
> @@ -156,9 +157,17 @@ struct kioctx {
>   unsignedid;
>  };
>  
> +struct poll_iocb {
> + struct file *file;
> + __poll_tevents;
> + struct wait_queue_head  *head;
> + struct wait_queue_entry wait;
> +};
> +
>  struct aio_kiocb {
>   union {
>   struct kiocbrw;
> + struct poll_iocbpoll;
>   };
>  
>   struct kioctx   *ki_ctx;
> @@ -1565,6 +1574,96 @@ static ssize_t aio_write(struct kiocb *req, struct 
> iocb *iocb, bool vectored,
>   return ret;
>  }
>  
> +static void __aio_complete_poll(struct poll_iocb *req, __poll_t mask)
> +{
> + fput(req->file);
> + aio_complete(container_of(req, struct aio_kiocb, poll),
> + mangle_poll(mask), 0);
> +}
> +
> +static void aio_complete_poll(struct poll_iocb *req, __poll_t mask)
> +{
> + struct aio_kiocb *iocb = container_of(req, struct aio_kiocb, poll);
> +
> + if (!(iocb->flags & AIO_IOCB_CANCELLED))
> + __aio_complete_poll(req, mask);
> +}
> +
> +static int aio_poll_cancel(struct kiocb *rw)
> +{
> + struct aio_kiocb *iocb = container_of(rw, struct aio_kiocb, rw);
> +
> + remove_wait_queue(iocb->poll.head, >poll.wait);
> + __aio_complete_poll(>poll, 0); /* no events to report */
> + return 0;
> +}
> +
> +static int aio_poll_wake(struct wait_queue_entry *wait, unsigned mode, int 
> sync,
> + void *key)
> +{
> + struct poll_iocb *req = container_of(wait, struct poll_iocb, wait);
> + struct file *file = req->file;
> + __poll_t mask = key_to_poll(key);
> +
> + assert_spin_locked(>head->lock);
> +
> + /* for instances that support it check for an event match first: */
> + if (mask && !(mask & req->events))
> + return 0;
> +
> + mask = vfs_poll_mask(file, req->events);
> + if (!mask)
> + return 0;
> +
> + __remove_wait_queue(req->head, >wait);
> + aio_complete_poll(req, mask);
> + return 1;
> +}
> +
> +static ssize_t aio_poll(struct aio_kiocb *aiocb, struct iocb *iocb)
> +{
> + struct poll_iocb *req = >poll;
> + unsigned long flags;
> + __poll_t mask;
> +
> + /* reject any unknown events outside the normal event mask. */
> + if ((u16)iocb->aio_buf != iocb->aio_buf)
> + return -EINVAL;
> + /* reject fields that are not defined for poll */
> + if (iocb->aio_offset || iocb->aio_nbytes || iocb->aio_rw_flags)
> + return -EINVAL;
> +
> + req->events = demangle_poll(iocb->aio_buf) | POLLERR | POLLHUP;
> + req->file = fget(iocb->aio_fildes);
> + if (unlikely(!req->file))
> + return -EBADF;
> +
> + req->head = vfs_get_poll_head(req->file, req->events);
> + if (!req->head) {
> + fput(req->file);
> + return -EINVAL; /* same as no support for IOCB_CMD_POLL */
> + }
> + if (IS_ERR(req->head)) {
> + mask = PTR_TO_POLL(req->head);
> + goto done;
> + }
> +
> + init_waitqueue_func_entry(>wait, aio_poll_wake);
> +
> + spin_lock_irqsave(>head->lock, flags);
> + mask = vfs_poll_mask(req->file, req->events);
> + if (!mask) {
> + __kiocb_set_cancel_fn(aiocb, aio_poll_cancel,
> + AIO_IOCB_DELAYED_CANCEL);
> + __

Re: [PATCH 08/36] aio: implement io_pgetevents

2018-03-05 Thread Jeff Moyer
Christoph Hellwig <h...@lst.de> writes:

> This is the io_getevents equivalent of ppoll/pselect and allows to
> properly mix signals and aio completions (especially with IOCB_CMD_POLL)
> and atomically executes the following sequence:
>
>   sigset_t origmask;
>
>   pthread_sigmask(SIG_SETMASK, , );
>   ret = io_getevents(ctx, min_nr, nr, events, timeout);
>   pthread_sigmask(SIG_SETMASK, , NULL);
>
> Note that unlike many other signal related calls we do not pass a sigmask
> size, as that would get us to 7 arguments, which aren't easily supported
> by the syscall infrastructure.  It seems a lot less painful to just add a
> new syscall variant in the unlikely case we're going to increase the
> sigset size.
>
> Signed-off-by: Christoph Hellwig <h...@lst.de>

I acked this in the last set, so...

Acked-by: Jeff Moyer <jmo...@redhat.com>

> ---
>  arch/x86/entry/syscalls/syscall_32.tbl |   1 +
>  arch/x86/entry/syscalls/syscall_64.tbl |   1 +
>  fs/aio.c   | 114 
> ++---
>  include/linux/compat.h |   7 ++
>  include/linux/syscalls.h   |   6 ++
>  include/uapi/asm-generic/unistd.h  |   4 +-
>  include/uapi/linux/aio_abi.h   |   6 ++
>  kernel/sys_ni.c|   2 +
>  8 files changed, 130 insertions(+), 11 deletions(-)
>
> diff --git a/arch/x86/entry/syscalls/syscall_32.tbl 
> b/arch/x86/entry/syscalls/syscall_32.tbl
> index 448ac2161112..5997c3e9ac3e 100644
> --- a/arch/x86/entry/syscalls/syscall_32.tbl
> +++ b/arch/x86/entry/syscalls/syscall_32.tbl
> @@ -391,3 +391,4 @@
>  382  i386pkey_free   sys_pkey_free
>  383  i386statx   sys_statx
>  384  i386arch_prctl  sys_arch_prctl  
> compat_sys_arch_prctl
> +385  i386io_pgetevents   sys_io_pgetevents   
> compat_sys_io_pgetevents
> diff --git a/arch/x86/entry/syscalls/syscall_64.tbl 
> b/arch/x86/entry/syscalls/syscall_64.tbl
> index 5aef183e2f85..e995cd2b4e65 100644
> --- a/arch/x86/entry/syscalls/syscall_64.tbl
> +++ b/arch/x86/entry/syscalls/syscall_64.tbl
> @@ -339,6 +339,7 @@
>  330  common  pkey_alloc  sys_pkey_alloc
>  331  common  pkey_free   sys_pkey_free
>  332  common  statx   sys_statx
> +333  common  io_pgetevents   sys_io_pgetevents
>  
>  #
>  # x32-specific system call numbers start at 512 to avoid cache impact
> diff --git a/fs/aio.c b/fs/aio.c
> index 9d7d6e4cde87..da87cbf7c67a 100644
> --- a/fs/aio.c
> +++ b/fs/aio.c
> @@ -1291,10 +1291,6 @@ static long read_events(struct kioctx *ctx, long 
> min_nr, long nr,
>   wait_event_interruptible_hrtimeout(ctx->wait,
>   aio_read_events(ctx, min_nr, nr, event, ),
>   until);
> -
> - if (!ret && signal_pending(current))
> - ret = -EINTR;
> -
>   return ret;
>  }
>  
> @@ -1874,13 +1870,60 @@ SYSCALL_DEFINE5(io_getevents, aio_context_t, ctx_id,
>   struct timespec __user *, timeout)
>  {
>   struct timespec64   ts;
> + int ret;
> +
> + if (timeout && unlikely(get_timespec64(, timeout)))
> + return -EFAULT;
> +
> + ret = do_io_getevents(ctx_id, min_nr, nr, events, timeout ?  : NULL);
> + if (!ret && signal_pending(current))
> + ret = -EINTR;
> + return ret;
> +}
> +
> +SYSCALL_DEFINE6(io_pgetevents,
> + aio_context_t, ctx_id,
> + long, min_nr,
> + long, nr,
> + struct io_event __user *, events,
> + struct timespec __user *, timeout,
> + const struct __aio_sigset __user *, usig)
> +{
> + struct __aio_sigset ksig = { NULL, };
> + sigset_tksigmask, sigsaved;
> + struct timespec64   ts;
> + int ret;
> +
> + if (timeout && unlikely(get_timespec64(, timeout)))
> + return -EFAULT;
>  
> - if (timeout) {
> - if (unlikely(get_timespec64(, timeout)))
> + if (usig && copy_from_user(, usig, sizeof(ksig)))
> + return -EFAULT;
> +
> + if (ksig.sigmask) {
> + if (ksig.sigsetsize != sizeof(sigset_t))
> + return -EINVAL;
> + if (copy_from_user(, ksig.sigmask, sizeof(ksigmask)))
>   return -EFAULT;
> + sigdelsetmask(, sigmask(SIGKILL) | sigmask(SIGSTOP));
> + sigprocmask(SIG_SETMASK, , );
> + }
> +
> + ret = do_io_getevents(c

Re: [PATCH 08/36] aio: implement io_pgetevents

2018-03-05 Thread Jeff Moyer
Christoph Hellwig  writes:

> This is the io_getevents equivalent of ppoll/pselect and allows to
> properly mix signals and aio completions (especially with IOCB_CMD_POLL)
> and atomically executes the following sequence:
>
>   sigset_t origmask;
>
>   pthread_sigmask(SIG_SETMASK, , );
>   ret = io_getevents(ctx, min_nr, nr, events, timeout);
>   pthread_sigmask(SIG_SETMASK, , NULL);
>
> Note that unlike many other signal related calls we do not pass a sigmask
> size, as that would get us to 7 arguments, which aren't easily supported
> by the syscall infrastructure.  It seems a lot less painful to just add a
> new syscall variant in the unlikely case we're going to increase the
> sigset size.
>
> Signed-off-by: Christoph Hellwig 

I acked this in the last set, so...

Acked-by: Jeff Moyer 

> ---
>  arch/x86/entry/syscalls/syscall_32.tbl |   1 +
>  arch/x86/entry/syscalls/syscall_64.tbl |   1 +
>  fs/aio.c   | 114 
> ++---
>  include/linux/compat.h |   7 ++
>  include/linux/syscalls.h   |   6 ++
>  include/uapi/asm-generic/unistd.h  |   4 +-
>  include/uapi/linux/aio_abi.h   |   6 ++
>  kernel/sys_ni.c|   2 +
>  8 files changed, 130 insertions(+), 11 deletions(-)
>
> diff --git a/arch/x86/entry/syscalls/syscall_32.tbl 
> b/arch/x86/entry/syscalls/syscall_32.tbl
> index 448ac2161112..5997c3e9ac3e 100644
> --- a/arch/x86/entry/syscalls/syscall_32.tbl
> +++ b/arch/x86/entry/syscalls/syscall_32.tbl
> @@ -391,3 +391,4 @@
>  382  i386pkey_free   sys_pkey_free
>  383  i386statx   sys_statx
>  384  i386arch_prctl  sys_arch_prctl  
> compat_sys_arch_prctl
> +385  i386io_pgetevents   sys_io_pgetevents   
> compat_sys_io_pgetevents
> diff --git a/arch/x86/entry/syscalls/syscall_64.tbl 
> b/arch/x86/entry/syscalls/syscall_64.tbl
> index 5aef183e2f85..e995cd2b4e65 100644
> --- a/arch/x86/entry/syscalls/syscall_64.tbl
> +++ b/arch/x86/entry/syscalls/syscall_64.tbl
> @@ -339,6 +339,7 @@
>  330  common  pkey_alloc  sys_pkey_alloc
>  331  common  pkey_free   sys_pkey_free
>  332  common  statx   sys_statx
> +333  common  io_pgetevents   sys_io_pgetevents
>  
>  #
>  # x32-specific system call numbers start at 512 to avoid cache impact
> diff --git a/fs/aio.c b/fs/aio.c
> index 9d7d6e4cde87..da87cbf7c67a 100644
> --- a/fs/aio.c
> +++ b/fs/aio.c
> @@ -1291,10 +1291,6 @@ static long read_events(struct kioctx *ctx, long 
> min_nr, long nr,
>   wait_event_interruptible_hrtimeout(ctx->wait,
>   aio_read_events(ctx, min_nr, nr, event, ),
>   until);
> -
> - if (!ret && signal_pending(current))
> - ret = -EINTR;
> -
>   return ret;
>  }
>  
> @@ -1874,13 +1870,60 @@ SYSCALL_DEFINE5(io_getevents, aio_context_t, ctx_id,
>   struct timespec __user *, timeout)
>  {
>   struct timespec64   ts;
> + int ret;
> +
> + if (timeout && unlikely(get_timespec64(, timeout)))
> + return -EFAULT;
> +
> + ret = do_io_getevents(ctx_id, min_nr, nr, events, timeout ?  : NULL);
> + if (!ret && signal_pending(current))
> + ret = -EINTR;
> + return ret;
> +}
> +
> +SYSCALL_DEFINE6(io_pgetevents,
> + aio_context_t, ctx_id,
> + long, min_nr,
> + long, nr,
> + struct io_event __user *, events,
> + struct timespec __user *, timeout,
> + const struct __aio_sigset __user *, usig)
> +{
> + struct __aio_sigset ksig = { NULL, };
> + sigset_tksigmask, sigsaved;
> + struct timespec64   ts;
> + int ret;
> +
> + if (timeout && unlikely(get_timespec64(, timeout)))
> + return -EFAULT;
>  
> - if (timeout) {
> - if (unlikely(get_timespec64(, timeout)))
> + if (usig && copy_from_user(, usig, sizeof(ksig)))
> + return -EFAULT;
> +
> + if (ksig.sigmask) {
> + if (ksig.sigsetsize != sizeof(sigset_t))
> + return -EINVAL;
> + if (copy_from_user(, ksig.sigmask, sizeof(ksigmask)))
>   return -EFAULT;
> + sigdelsetmask(, sigmask(SIGKILL) | sigmask(SIGSTOP));
> + sigprocmask(SIG_SETMASK, , );
> + }
> +
> + ret = do_io_getevents(ctx_id, min_nr, nr, events, timeout ?  : NULL);
> + if (signal_

Re: [PATCH v2] libnvdimm: re-enable deep flush for pmem devices

2018-02-13 Thread Jeff Moyer
Dan Williams <dan.j.willi...@intel.com> writes:

> On Mon, Feb 12, 2018 at 2:53 PM, Jeff Moyer <jmo...@redhat.com> wrote:
>> Dave Jiang <dave.ji...@intel.com> writes:
>>
>>> Re-enable deep flush so that users always have a way to be sure that a write
>>> does make it all the way out to the NVDIMM. The PMEM driver writes always
>>> make it "all the way to the NVDIMM", and it relies on the ADR mechanism to
>>> flush the write buffers on power failure. Deep flush is there to explicitly
>>> flush those write buffers to protect against (rare) ADR failure.
>>> This change prevents a regression in deep flush behavior so that 
>>> applications
>>> can continue to depend on fsync() as a mechanism to trigger deep flush in 
>>> the
>>> filesystem-dax case.
>>
>> That's still very confusing text.  Specifically, the part where you say
>> that pmem driver writes always make it to the DIMM.  I think the
>> changelog could start with "Deep flush is there to explicitly flush
>> write buffers"  Anyway, the fix looks right to me.
>
> I ended up changing the commit message to this, let me know if it reads 
> better:

Thanks.  It's still unclear to me what the text, "The PMEM driver writes
always arrive at the NVDIMM" means.  However, it's good enough.

Thanks!
Jeff

>
> libnvdimm: re-enable deep flush for pmem devices via fsync()
>
> Re-enable deep flush so that users always have a way to be sure that a
> write makes it all the way out to media. The PMEM driver writes always
> arrive at the NVDIMM, and it relies on the ADR (Asynchronous DRAM
> Refresh) mechanism to flush the write buffers on power failure. Deep
> flush is there to explicitly flush those write buffers to protect
> against (rare) ADR failure.  This change prevents a regression in deep
> flush behavior so that applications can continue to depend on fsync() as
> a mechanism to trigger deep flush in the filesystem-DAX case.
>
> Fixes: 06e8ccdab15f4 ("acpi: nfit: Add support for detect platform
> CPU cache...")
> Signed-off-by: Dave Jiang <dave.ji...@intel.com>
> Signed-off-by: Dan Williams <dan.j.willi...@intel.com>


Re: [PATCH v2] libnvdimm: re-enable deep flush for pmem devices

2018-02-13 Thread Jeff Moyer
Dan Williams  writes:

> On Mon, Feb 12, 2018 at 2:53 PM, Jeff Moyer  wrote:
>> Dave Jiang  writes:
>>
>>> Re-enable deep flush so that users always have a way to be sure that a write
>>> does make it all the way out to the NVDIMM. The PMEM driver writes always
>>> make it "all the way to the NVDIMM", and it relies on the ADR mechanism to
>>> flush the write buffers on power failure. Deep flush is there to explicitly
>>> flush those write buffers to protect against (rare) ADR failure.
>>> This change prevents a regression in deep flush behavior so that 
>>> applications
>>> can continue to depend on fsync() as a mechanism to trigger deep flush in 
>>> the
>>> filesystem-dax case.
>>
>> That's still very confusing text.  Specifically, the part where you say
>> that pmem driver writes always make it to the DIMM.  I think the
>> changelog could start with "Deep flush is there to explicitly flush
>> write buffers"  Anyway, the fix looks right to me.
>
> I ended up changing the commit message to this, let me know if it reads 
> better:

Thanks.  It's still unclear to me what the text, "The PMEM driver writes
always arrive at the NVDIMM" means.  However, it's good enough.

Thanks!
Jeff

>
> libnvdimm: re-enable deep flush for pmem devices via fsync()
>
> Re-enable deep flush so that users always have a way to be sure that a
> write makes it all the way out to media. The PMEM driver writes always
> arrive at the NVDIMM, and it relies on the ADR (Asynchronous DRAM
> Refresh) mechanism to flush the write buffers on power failure. Deep
> flush is there to explicitly flush those write buffers to protect
> against (rare) ADR failure.  This change prevents a regression in deep
> flush behavior so that applications can continue to depend on fsync() as
> a mechanism to trigger deep flush in the filesystem-DAX case.
>
> Fixes: 06e8ccdab15f4 ("acpi: nfit: Add support for detect platform
> CPU cache...")
> Signed-off-by: Dave Jiang 
> Signed-off-by: Dan Williams 


Re: [PATCH v2] libnvdimm: re-enable deep flush for pmem devices

2018-02-12 Thread Jeff Moyer
Dave Jiang <dave.ji...@intel.com> writes:

> Re-enable deep flush so that users always have a way to be sure that a write
> does make it all the way out to the NVDIMM. The PMEM driver writes always
> make it "all the way to the NVDIMM", and it relies on the ADR mechanism to
> flush the write buffers on power failure. Deep flush is there to explicitly
> flush those write buffers to protect against (rare) ADR failure.
> This change prevents a regression in deep flush behavior so that applications
> can continue to depend on fsync() as a mechanism to trigger deep flush in the
> filesystem-dax case.

That's still very confusing text.  Specifically, the part where you say
that pmem driver writes always make it to the DIMM.  I think the
changelog could start with "Deep flush is there to explicitly flush
write buffers"  Anyway, the fix looks right to me.

Reviewed-by: Jeff Moyer <jmo...@redhat.com>


Re: [PATCH v2] libnvdimm: re-enable deep flush for pmem devices

2018-02-12 Thread Jeff Moyer
Dave Jiang  writes:

> Re-enable deep flush so that users always have a way to be sure that a write
> does make it all the way out to the NVDIMM. The PMEM driver writes always
> make it "all the way to the NVDIMM", and it relies on the ADR mechanism to
> flush the write buffers on power failure. Deep flush is there to explicitly
> flush those write buffers to protect against (rare) ADR failure.
> This change prevents a regression in deep flush behavior so that applications
> can continue to depend on fsync() as a mechanism to trigger deep flush in the
> filesystem-dax case.

That's still very confusing text.  Specifically, the part where you say
that pmem driver writes always make it to the DIMM.  I think the
changelog could start with "Deep flush is there to explicitly flush
write buffers"  Anyway, the fix looks right to me.

Reviewed-by: Jeff Moyer 


Re: [PATCH 08/36] aio: implement io_pgetevents

2018-01-24 Thread Jeff Moyer
Christoph Hellwig <h...@lst.de> writes:

> This is the io_getevents equivalent of ppoll/pselect and allows to
> properly mix signals and aio completions (especially with IOCB_CMD_POLL)
> and atomically executes the following sequence:
>
>   sigset_t origmask;
>
>   pthread_sigmask(SIG_SETMASK, , );
>   ret = io_getevents(ctx, min_nr, nr, events, timeout);
>   pthread_sigmask(SIG_SETMASK, , NULL);
>
> Note that unlike many other signal related calls we do not pass a sigmask
> size, as that would get us to 7 arguments, which aren't easily supported
> by the syscall infrastructure.  It seems a lot less painful to just add a
> new syscall variant in the unlikely case we're going to increase the
> sigset size.
>
> Signed-off-by: Christoph Hellwig <h...@lst.de>

Acked-by: Jeff Moyer <jmo...@redhat.com>

I assume other arch maintainers will also plumb this in?

-Jeff

> ---
>  arch/x86/entry/syscalls/syscall_32.tbl |   1 +
>  arch/x86/entry/syscalls/syscall_64.tbl |   1 +
>  fs/aio.c   | 114 
> ++---
>  include/linux/compat.h |   7 ++
>  include/linux/syscalls.h   |   6 ++
>  include/uapi/asm-generic/unistd.h  |   4 +-
>  include/uapi/linux/aio_abi.h   |   6 ++
>  kernel/sys_ni.c|   2 +
>  8 files changed, 130 insertions(+), 11 deletions(-)
>
> diff --git a/arch/x86/entry/syscalls/syscall_32.tbl 
> b/arch/x86/entry/syscalls/syscall_32.tbl
> index 448ac2161112..5997c3e9ac3e 100644
> --- a/arch/x86/entry/syscalls/syscall_32.tbl
> +++ b/arch/x86/entry/syscalls/syscall_32.tbl
> @@ -391,3 +391,4 @@
>  382  i386pkey_free   sys_pkey_free
>  383  i386statx   sys_statx
>  384  i386arch_prctl  sys_arch_prctl  
> compat_sys_arch_prctl
> +385  i386io_pgetevents   sys_io_pgetevents   
> compat_sys_io_pgetevents
> diff --git a/arch/x86/entry/syscalls/syscall_64.tbl 
> b/arch/x86/entry/syscalls/syscall_64.tbl
> index 5aef183e2f85..e995cd2b4e65 100644
> --- a/arch/x86/entry/syscalls/syscall_64.tbl
> +++ b/arch/x86/entry/syscalls/syscall_64.tbl
> @@ -339,6 +339,7 @@
>  330  common  pkey_alloc  sys_pkey_alloc
>  331  common  pkey_free   sys_pkey_free
>  332  common  statx   sys_statx
> +333  common  io_pgetevents   sys_io_pgetevents
>  
>  #
>  # x32-specific system call numbers start at 512 to avoid cache impact
> diff --git a/fs/aio.c b/fs/aio.c
> index 9d7d6e4cde87..da87cbf7c67a 100644
> --- a/fs/aio.c
> +++ b/fs/aio.c
> @@ -1291,10 +1291,6 @@ static long read_events(struct kioctx *ctx, long 
> min_nr, long nr,
>   wait_event_interruptible_hrtimeout(ctx->wait,
>   aio_read_events(ctx, min_nr, nr, event, ),
>   until);
> -
> - if (!ret && signal_pending(current))
> - ret = -EINTR;
> -
>   return ret;
>  }
>  
> @@ -1874,13 +1870,60 @@ SYSCALL_DEFINE5(io_getevents, aio_context_t, ctx_id,
>   struct timespec __user *, timeout)
>  {
>   struct timespec64   ts;
> + int ret;
> +
> + if (timeout && unlikely(get_timespec64(, timeout)))
> + return -EFAULT;
> +
> + ret = do_io_getevents(ctx_id, min_nr, nr, events, timeout ?  : NULL);
> + if (!ret && signal_pending(current))
> + ret = -EINTR;
> + return ret;
> +}
> +
> +SYSCALL_DEFINE6(io_pgetevents,
> + aio_context_t, ctx_id,
> + long, min_nr,
> + long, nr,
> + struct io_event __user *, events,
> + struct timespec __user *, timeout,
> + const struct __aio_sigset __user *, usig)
> +{
> + struct __aio_sigset ksig = { NULL, };
> + sigset_tksigmask, sigsaved;
> + struct timespec64   ts;
> + int ret;
> +
> + if (timeout && unlikely(get_timespec64(, timeout)))
> + return -EFAULT;
>  
> - if (timeout) {
> - if (unlikely(get_timespec64(, timeout)))
> + if (usig && copy_from_user(, usig, sizeof(ksig)))
> + return -EFAULT;
> +
> + if (ksig.sigmask) {
> + if (ksig.sigsetsize != sizeof(sigset_t))
> + return -EINVAL;
> + if (copy_from_user(, ksig.sigmask, sizeof(ksigmask)))
>   return -EFAULT;
> + sigdelsetmask(, sigmask(SIGKILL) | sigmask(SIGSTOP));
> + sigprocmask(SIG_SETMASK, , );
> + }
>

Re: [PATCH 08/36] aio: implement io_pgetevents

2018-01-24 Thread Jeff Moyer
Christoph Hellwig  writes:

> This is the io_getevents equivalent of ppoll/pselect and allows to
> properly mix signals and aio completions (especially with IOCB_CMD_POLL)
> and atomically executes the following sequence:
>
>   sigset_t origmask;
>
>   pthread_sigmask(SIG_SETMASK, , );
>   ret = io_getevents(ctx, min_nr, nr, events, timeout);
>   pthread_sigmask(SIG_SETMASK, , NULL);
>
> Note that unlike many other signal related calls we do not pass a sigmask
> size, as that would get us to 7 arguments, which aren't easily supported
> by the syscall infrastructure.  It seems a lot less painful to just add a
> new syscall variant in the unlikely case we're going to increase the
> sigset size.
>
> Signed-off-by: Christoph Hellwig 

Acked-by: Jeff Moyer 

I assume other arch maintainers will also plumb this in?

-Jeff

> ---
>  arch/x86/entry/syscalls/syscall_32.tbl |   1 +
>  arch/x86/entry/syscalls/syscall_64.tbl |   1 +
>  fs/aio.c   | 114 
> ++---
>  include/linux/compat.h |   7 ++
>  include/linux/syscalls.h   |   6 ++
>  include/uapi/asm-generic/unistd.h  |   4 +-
>  include/uapi/linux/aio_abi.h   |   6 ++
>  kernel/sys_ni.c|   2 +
>  8 files changed, 130 insertions(+), 11 deletions(-)
>
> diff --git a/arch/x86/entry/syscalls/syscall_32.tbl 
> b/arch/x86/entry/syscalls/syscall_32.tbl
> index 448ac2161112..5997c3e9ac3e 100644
> --- a/arch/x86/entry/syscalls/syscall_32.tbl
> +++ b/arch/x86/entry/syscalls/syscall_32.tbl
> @@ -391,3 +391,4 @@
>  382  i386pkey_free   sys_pkey_free
>  383  i386statx   sys_statx
>  384  i386arch_prctl  sys_arch_prctl  
> compat_sys_arch_prctl
> +385  i386io_pgetevents   sys_io_pgetevents   
> compat_sys_io_pgetevents
> diff --git a/arch/x86/entry/syscalls/syscall_64.tbl 
> b/arch/x86/entry/syscalls/syscall_64.tbl
> index 5aef183e2f85..e995cd2b4e65 100644
> --- a/arch/x86/entry/syscalls/syscall_64.tbl
> +++ b/arch/x86/entry/syscalls/syscall_64.tbl
> @@ -339,6 +339,7 @@
>  330  common  pkey_alloc  sys_pkey_alloc
>  331  common  pkey_free   sys_pkey_free
>  332  common  statx   sys_statx
> +333  common  io_pgetevents   sys_io_pgetevents
>  
>  #
>  # x32-specific system call numbers start at 512 to avoid cache impact
> diff --git a/fs/aio.c b/fs/aio.c
> index 9d7d6e4cde87..da87cbf7c67a 100644
> --- a/fs/aio.c
> +++ b/fs/aio.c
> @@ -1291,10 +1291,6 @@ static long read_events(struct kioctx *ctx, long 
> min_nr, long nr,
>   wait_event_interruptible_hrtimeout(ctx->wait,
>   aio_read_events(ctx, min_nr, nr, event, ),
>   until);
> -
> - if (!ret && signal_pending(current))
> - ret = -EINTR;
> -
>   return ret;
>  }
>  
> @@ -1874,13 +1870,60 @@ SYSCALL_DEFINE5(io_getevents, aio_context_t, ctx_id,
>   struct timespec __user *, timeout)
>  {
>   struct timespec64   ts;
> + int ret;
> +
> + if (timeout && unlikely(get_timespec64(, timeout)))
> + return -EFAULT;
> +
> + ret = do_io_getevents(ctx_id, min_nr, nr, events, timeout ?  : NULL);
> + if (!ret && signal_pending(current))
> + ret = -EINTR;
> + return ret;
> +}
> +
> +SYSCALL_DEFINE6(io_pgetevents,
> + aio_context_t, ctx_id,
> + long, min_nr,
> + long, nr,
> + struct io_event __user *, events,
> + struct timespec __user *, timeout,
> + const struct __aio_sigset __user *, usig)
> +{
> + struct __aio_sigset ksig = { NULL, };
> + sigset_tksigmask, sigsaved;
> + struct timespec64   ts;
> + int ret;
> +
> + if (timeout && unlikely(get_timespec64(, timeout)))
> + return -EFAULT;
>  
> - if (timeout) {
> - if (unlikely(get_timespec64(, timeout)))
> + if (usig && copy_from_user(, usig, sizeof(ksig)))
> + return -EFAULT;
> +
> + if (ksig.sigmask) {
> + if (ksig.sigsetsize != sizeof(sigset_t))
> + return -EINVAL;
> + if (copy_from_user(, ksig.sigmask, sizeof(ksigmask)))
>   return -EFAULT;
> + sigdelsetmask(, sigmask(SIGKILL) | sigmask(SIGSTOP));
> + sigprocmask(SIG_SETMASK, , );
> + }
> +
> + ret = do_io_getevents(ctx_id, min_nr, nr, events, timeout ?  

Re: [PATCH 14/36] aio: implement IOCB_CMD_POLL

2018-01-24 Thread Jeff Moyer
Christoph Hellwig <h...@lst.de> writes:

> Simple one-shot poll through the io_submit() interface.  To poll for
> a file descriptor the application should submit an iocb of type
> IOCB_CMD_POLL.  It will poll the fd for the events specified in the
> the first 32 bits of the aio_buf field of the iocb.
>
> Unlike poll or epoll without EPOLLONESHOT this interface always works
> in one shot mode, that is once the iocb is completed, it will have to be
> resubmitted.
>
> Signed-off-by: Christoph Hellwig <h...@lst.de>

Acked-by: Jeff Moyer <jmo...@redhat.com>


Re: [PATCH 14/36] aio: implement IOCB_CMD_POLL

2018-01-24 Thread Jeff Moyer
Christoph Hellwig  writes:

> Simple one-shot poll through the io_submit() interface.  To poll for
> a file descriptor the application should submit an iocb of type
> IOCB_CMD_POLL.  It will poll the fd for the events specified in the
> the first 32 bits of the aio_buf field of the iocb.
>
> Unlike poll or epoll without EPOLLONESHOT this interface always works
> in one shot mode, that is once the iocb is completed, it will have to be
> resubmitted.
>
> Signed-off-by: Christoph Hellwig 

Acked-by: Jeff Moyer 


Re: aio poll, io_pgetevents and a new in-kernel poll API V3

2018-01-18 Thread Jeff Moyer
Christoph Hellwig <h...@lst.de> writes:

> On Thu, Jan 18, 2018 at 11:44:03AM -0500, Jeff Moyer wrote:
>> Jeff Moyer <jmo...@redhat.com> writes:
>> 
>> > FYI, this kernel has issues.  It will boot up, but I don't have
>> > networking, and even rebooting doesn't succeed.  I'm looking into it.
>> 
>> A bisect lands on: eventfd: switch to ->poll_mask.  That's not super
>> helpful, though.  I did run the ltp eventfd2 tests, and they all pass.
>> 
>> The actual issue I get on boot is that several services don't start:
>>
>> [...]
>> 
>> Christoph, are you able to reproduce this?
>
> No, I can't reproduce any of that.  But I don't have a Fedora system
> either, so this might be a new systemd version doing funky things.
>
> The major change in this version was to call ->poll_mask before setting up
> the wait queue as well.  This does the right thing for poll and aio poll,
> but the more I dig into the epoll code the less sure I am it does the right
> thing for it, or in fact that epoll does the right thing in general..
>
> Do you still see it with the patch below applied?

No, that fixes it for me.

-Jeff

> diff --git a/fs/select.c b/fs/select.c
> index 707abe79536b..1784c1a29253 100644
> --- a/fs/select.c
> +++ b/fs/select.c
> @@ -53,9 +53,9 @@ __poll_t vfs_poll(struct file *file, struct 
> poll_table_struct *pt)
>   head = vfs_get_poll_head(file, events);
>   if (!head)
>   return DEFAULT_POLLMASK;
> - mask = file->f_op->poll_mask(file, events);
> - if (mask)
> - return mask;
> +//   mask = file->f_op->poll_mask(file, events);
> +//   if (mask)
> +//   return mask;
>  
>   pt->_qproc(file, head, pt);
>   }


Re: aio poll, io_pgetevents and a new in-kernel poll API V3

2018-01-18 Thread Jeff Moyer
Christoph Hellwig  writes:

> On Thu, Jan 18, 2018 at 11:44:03AM -0500, Jeff Moyer wrote:
>> Jeff Moyer  writes:
>> 
>> > FYI, this kernel has issues.  It will boot up, but I don't have
>> > networking, and even rebooting doesn't succeed.  I'm looking into it.
>> 
>> A bisect lands on: eventfd: switch to ->poll_mask.  That's not super
>> helpful, though.  I did run the ltp eventfd2 tests, and they all pass.
>> 
>> The actual issue I get on boot is that several services don't start:
>>
>> [...]
>> 
>> Christoph, are you able to reproduce this?
>
> No, I can't reproduce any of that.  But I don't have a Fedora system
> either, so this might be a new systemd version doing funky things.
>
> The major change in this version was to call ->poll_mask before setting up
> the wait queue as well.  This does the right thing for poll and aio poll,
> but the more I dig into the epoll code the less sure I am it does the right
> thing for it, or in fact that epoll does the right thing in general..
>
> Do you still see it with the patch below applied?

No, that fixes it for me.

-Jeff

> diff --git a/fs/select.c b/fs/select.c
> index 707abe79536b..1784c1a29253 100644
> --- a/fs/select.c
> +++ b/fs/select.c
> @@ -53,9 +53,9 @@ __poll_t vfs_poll(struct file *file, struct 
> poll_table_struct *pt)
>   head = vfs_get_poll_head(file, events);
>   if (!head)
>   return DEFAULT_POLLMASK;
> - mask = file->f_op->poll_mask(file, events);
> - if (mask)
> - return mask;
> +//   mask = file->f_op->poll_mask(file, events);
> +//   if (mask)
> +//   return mask;
>  
>   pt->_qproc(file, head, pt);
>   }


Re: aio poll, io_pgetevents and a new in-kernel poll API V3

2018-01-18 Thread Jeff Moyer
Avi Kivity <a...@scylladb.com> writes:

> On 01/18/2018 05:46 PM, Jeff Moyer wrote:
>> FYI, this kernel has issues.  It will boot up, but I don't have
>> networking, and even rebooting doesn't succeed.  I'm looking into it.
>
> FWIW, I'm running an older version of this patchset on my desktop with
> no problems so far.

Right, it's worth mentioning that v1 worked fine for me.

-Jeff

>> Christoph Hellwig <h...@lst.de> writes:
>>
>>> Hi all,
>>>
>>> this series adds support for the IOCB_CMD_POLL operation to poll for the
>>> readyness of file descriptors using the aio subsystem.  The API is based
>>> on patches that existed in RHAS2.1 and RHEL3, which means it already is
>>> supported by libaio.  To implement the poll support efficiently new
>>> methods to poll are introduced in struct file_operations:  get_poll_head
>>> and poll_mask.  The first one returns a wait_queue_head to wait on
>>> (lifetime is bound by the file), and the second does a non-blocking
>>> check for the POLL* events.  This allows aio poll to work without
>>> any additional context switches, unlike epoll.
>>>
>>> To make the interface fully useful a new io_pgetevents system call is
>>> added, which atomically saves and restores the signal mask over the
>>> io_pgetevents system call.  It it the logical equivalent to pselect and
>>> ppoll for io_pgetevents.
>>>
>>> The corresponding libaio changes for io_pgetevents support and
>>> documentation, as well as a test case will be posted in a separate
>>> series.
>>>
>>> The changes were sponsored by Scylladb, and improve performance
>>> of the seastar framework up to 10%, while also removing the need
>>> for a privileged SCHED_FIFO epoll listener thread.
>>>
>>> The patches are on top of Als __poll_t annoations, so I've also
>>> prepared a git branch on top of those here:
>>>
>>>  git://git.infradead.org/users/hch/vfs.git aio-poll.3
>>>
>>> Gitweb:
>>>
>>>  
>>> http://git.infradead.org/users/hch/vfs.git/shortlog/refs/heads/aio-poll.3
>>>
>>> Libaio changes:
>>>
>>>  https://pagure.io/libaio.git io-poll
>>>
>>> Seastar changes (not updated for the new io_pgetevens ABI yet):
>>>
>>>  https://github.com/avikivity/seastar/commits/aio
>>>
>>> Changes since V2:
>>>   - removed a double initialization
>>>   - new vfs_get_poll_head helper
>>>   - document that ->get_poll_head can return NULL
>>>   - call ->poll_mask before sleeping
>>>   - various ACKs
>>>   - add conversion of random to ->poll_mask
>>>   - add conversion of af_alg to ->poll_mask
>>>   - lacking ->poll_mask support now returns -EINVAL for IOCB_CMD_POLL
>>>   - reshuffled the series so that prep patches and everything not
>>> requiring the new in-kernel poll API is in the beginning
>>>
>>> Changes since V1:
>>>   - handle the NULL ->poll case in vfs_poll
>>>   - dropped the file argument to the ->poll_mask socket operation
>>>   - replace the ->pre_poll socket operation with ->get_poll_head as
>>> in the file operations
>>>
>>> --
>>> To unsubscribe, send a message with 'unsubscribe linux-aio' in
>>> the body to majord...@kvack.org.  For more info on Linux AIO,
>>> see: http://www.kvack.org/aio/
>>> Don't email: mailto:"a...@kvack.org;>a...@kvack.org
>
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-aio' in
> the body to majord...@kvack.org.  For more info on Linux AIO,
> see: http://www.kvack.org/aio/
> Don't email: mailto:"a...@kvack.org;>a...@kvack.org


Re: aio poll, io_pgetevents and a new in-kernel poll API V3

2018-01-18 Thread Jeff Moyer
Avi Kivity  writes:

> On 01/18/2018 05:46 PM, Jeff Moyer wrote:
>> FYI, this kernel has issues.  It will boot up, but I don't have
>> networking, and even rebooting doesn't succeed.  I'm looking into it.
>
> FWIW, I'm running an older version of this patchset on my desktop with
> no problems so far.

Right, it's worth mentioning that v1 worked fine for me.

-Jeff

>> Christoph Hellwig  writes:
>>
>>> Hi all,
>>>
>>> this series adds support for the IOCB_CMD_POLL operation to poll for the
>>> readyness of file descriptors using the aio subsystem.  The API is based
>>> on patches that existed in RHAS2.1 and RHEL3, which means it already is
>>> supported by libaio.  To implement the poll support efficiently new
>>> methods to poll are introduced in struct file_operations:  get_poll_head
>>> and poll_mask.  The first one returns a wait_queue_head to wait on
>>> (lifetime is bound by the file), and the second does a non-blocking
>>> check for the POLL* events.  This allows aio poll to work without
>>> any additional context switches, unlike epoll.
>>>
>>> To make the interface fully useful a new io_pgetevents system call is
>>> added, which atomically saves and restores the signal mask over the
>>> io_pgetevents system call.  It it the logical equivalent to pselect and
>>> ppoll for io_pgetevents.
>>>
>>> The corresponding libaio changes for io_pgetevents support and
>>> documentation, as well as a test case will be posted in a separate
>>> series.
>>>
>>> The changes were sponsored by Scylladb, and improve performance
>>> of the seastar framework up to 10%, while also removing the need
>>> for a privileged SCHED_FIFO epoll listener thread.
>>>
>>> The patches are on top of Als __poll_t annoations, so I've also
>>> prepared a git branch on top of those here:
>>>
>>>  git://git.infradead.org/users/hch/vfs.git aio-poll.3
>>>
>>> Gitweb:
>>>
>>>  
>>> http://git.infradead.org/users/hch/vfs.git/shortlog/refs/heads/aio-poll.3
>>>
>>> Libaio changes:
>>>
>>>  https://pagure.io/libaio.git io-poll
>>>
>>> Seastar changes (not updated for the new io_pgetevens ABI yet):
>>>
>>>  https://github.com/avikivity/seastar/commits/aio
>>>
>>> Changes since V2:
>>>   - removed a double initialization
>>>   - new vfs_get_poll_head helper
>>>   - document that ->get_poll_head can return NULL
>>>   - call ->poll_mask before sleeping
>>>   - various ACKs
>>>   - add conversion of random to ->poll_mask
>>>   - add conversion of af_alg to ->poll_mask
>>>   - lacking ->poll_mask support now returns -EINVAL for IOCB_CMD_POLL
>>>   - reshuffled the series so that prep patches and everything not
>>> requiring the new in-kernel poll API is in the beginning
>>>
>>> Changes since V1:
>>>   - handle the NULL ->poll case in vfs_poll
>>>   - dropped the file argument to the ->poll_mask socket operation
>>>   - replace the ->pre_poll socket operation with ->get_poll_head as
>>> in the file operations
>>>
>>> --
>>> To unsubscribe, send a message with 'unsubscribe linux-aio' in
>>> the body to majord...@kvack.org.  For more info on Linux AIO,
>>> see: http://www.kvack.org/aio/
>>> Don't email: mailto:"a...@kvack.org;>a...@kvack.org
>
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-aio' in
> the body to majord...@kvack.org.  For more info on Linux AIO,
> see: http://www.kvack.org/aio/
> Don't email: mailto:"a...@kvack.org;>a...@kvack.org


Re: aio poll, io_pgetevents and a new in-kernel poll API V3

2018-01-18 Thread Jeff Moyer
Jeff Moyer <jmo...@redhat.com> writes:

> FYI, this kernel has issues.  It will boot up, but I don't have
> networking, and even rebooting doesn't succeed.  I'm looking into it.

A bisect lands on: eventfd: switch to ->poll_mask.  That's not super
helpful, though.  I did run the ltp eventfd2 tests, and they all pass.

The actual issue I get on boot is that several services don't start:

[FAILED] Failed to start Modem Manager.
See 'systemctl status ModemManager.service' for details.
[FAILED] Failed to start Authorization Manager.
See 'systemctl status polkit.service' for details.
[DEPEND] Dependency failed for Dynamic System Tuning Daemon.
[FAILED] Failed to start Network Manager.
See 'systemctl status NetworkManager.service' for details.
[DEPEND] Dependency failed for Network Manager Wait Online.
...

Christoph, are you able to reproduce this?

-Jeff

> Christoph Hellwig <h...@lst.de> writes:
>
>> Hi all,
>>
>> this series adds support for the IOCB_CMD_POLL operation to poll for the
>> readyness of file descriptors using the aio subsystem.  The API is based
>> on patches that existed in RHAS2.1 and RHEL3, which means it already is
>> supported by libaio.  To implement the poll support efficiently new
>> methods to poll are introduced in struct file_operations:  get_poll_head
>> and poll_mask.  The first one returns a wait_queue_head to wait on
>> (lifetime is bound by the file), and the second does a non-blocking
>> check for the POLL* events.  This allows aio poll to work without
>> any additional context switches, unlike epoll.
>>
>> To make the interface fully useful a new io_pgetevents system call is
>> added, which atomically saves and restores the signal mask over the
>> io_pgetevents system call.  It it the logical equivalent to pselect and
>> ppoll for io_pgetevents.
>>
>> The corresponding libaio changes for io_pgetevents support and
>> documentation, as well as a test case will be posted in a separate
>> series.
>>
>> The changes were sponsored by Scylladb, and improve performance
>> of the seastar framework up to 10%, while also removing the need
>> for a privileged SCHED_FIFO epoll listener thread.
>>
>> The patches are on top of Als __poll_t annoations, so I've also
>> prepared a git branch on top of those here:
>>
>> git://git.infradead.org/users/hch/vfs.git aio-poll.3
>>
>> Gitweb:
>>
>> http://git.infradead.org/users/hch/vfs.git/shortlog/refs/heads/aio-poll.3
>>
>> Libaio changes:
>>
>> https://pagure.io/libaio.git io-poll
>>
>> Seastar changes (not updated for the new io_pgetevens ABI yet):
>>
>> https://github.com/avikivity/seastar/commits/aio
>>
>> Changes since V2:
>>  - removed a double initialization
>>  - new vfs_get_poll_head helper
>>  - document that ->get_poll_head can return NULL
>>  - call ->poll_mask before sleeping
>>  - various ACKs
>>  - add conversion of random to ->poll_mask
>>  - add conversion of af_alg to ->poll_mask
>>  - lacking ->poll_mask support now returns -EINVAL for IOCB_CMD_POLL
>>  - reshuffled the series so that prep patches and everything not
>>requiring the new in-kernel poll API is in the beginning
>>
>> Changes since V1:
>>  - handle the NULL ->poll case in vfs_poll
>>  - dropped the file argument to the ->poll_mask socket operation
>>  - replace the ->pre_poll socket operation with ->get_poll_head as
>>in the file operations
>>
>> --
>> To unsubscribe, send a message with 'unsubscribe linux-aio' in
>> the body to majord...@kvack.org.  For more info on Linux AIO,
>> see: http://www.kvack.org/aio/
>> Don't email: mailto:"a...@kvack.org;>a...@kvack.org
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-aio' in
> the body to majord...@kvack.org.  For more info on Linux AIO,
> see: http://www.kvack.org/aio/
> Don't email: mailto:"a...@kvack.org;>a...@kvack.org


  1   2   3   4   5   6   7   8   9   10   >