Re: [PATCH v5 4/5] dax: fix mapping lifetime handling, convert to __pfn_t + kmap_atomic_pfn_t()
On Thu, Aug 13, 2015 at 9:34 AM, Boaz Harrosh wrote: > On 08/13/2015 06:21 PM, Dan Williams wrote: >> On Wed, Aug 12, 2015 at 11:26 PM, Boaz Harrosh wrote: > <> >> >> Hmm, that's not the same block layer I've been working with for the >> past several years: >> >> $ mount /dev/pmem0 /mnt >> $ echo namespace0.0 > ../drivers/nd_pmem/unbind # succeeds >> >> Unbind always proceeds unconditionally. See the recent kernel summit >> topic discussion around devm vs unbind [1]. While kmap_atomic_pfn_t() >> does not implement revoke semantics it at least forces re-validation >> and time bounded references. For the unplug case we'll need to go >> shootdown those DAX mappings in userspace so that they return SIGBUS >> on access, or something along those lines. >> > > Then fix unbind to refuse. What is the point of unbind when it trashes > the hot path so badly and makes the code so fat. What? The DAX hot path avoids the kernel entirely. > Who uses it and what for? The device driver core. We simply can't hold off remove indefinitely. If the administrator wants the device disabled we need to tear down and revoke active mappings, or at very least guarantee time bounded removal. > First I ever heard of it and I do use Linux a little bit. > >> [1]: http://www.spinics.net/lists/kernel/msg2032864.html >> > Hm... > > OK I hate it. I would just make sure to override and refuse unbinding with an > elevated ref count. Is not a good reason for me to trash the hotpath. Again, the current usages are not in hot paths. If it becomes part of a hot path *and* shows up in a profile we can look to implement something with less overhead. Until then we should plan to honor the lifetime as defined by ->probe() and ->remove(). In fact I proposed the same as you, but then changed my mind based on Tejun's response [1]. So please reconsider this idea to solve the problem by blocking ->remove(). PMEM is new and special, but not *that* special as to justify breaking basic guarantees. [1]: https://lkml.org/lkml/2015/7/15/731 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v5 4/5] dax: fix mapping lifetime handling, convert to __pfn_t + kmap_atomic_pfn_t()
On 08/13/2015 06:21 PM, Dan Williams wrote: > On Wed, Aug 12, 2015 at 11:26 PM, Boaz Harrosh wrote: <> > > Hmm, that's not the same block layer I've been working with for the > past several years: > > $ mount /dev/pmem0 /mnt > $ echo namespace0.0 > ../drivers/nd_pmem/unbind # succeeds > > Unbind always proceeds unconditionally. See the recent kernel summit > topic discussion around devm vs unbind [1]. While kmap_atomic_pfn_t() > does not implement revoke semantics it at least forces re-validation > and time bounded references. For the unplug case we'll need to go > shootdown those DAX mappings in userspace so that they return SIGBUS > on access, or something along those lines. > Then fix unbind to refuse. What is the point of unbind when it trashes the hot path so badly and makes the code so fat. Who uses it and what for? First I ever heard of it and I do use Linux a little bit. > [1]: http://www.spinics.net/lists/kernel/msg2032864.html > Hm... OK I hate it. I would just make sure to override and refuse unbinding with an elevated ref count. Is not a good reason for me to trash the hotpath. >> And for god sake. I have a bdev I call bdev_direct_access(sector), the bdev >> calculated the >> exact address for me (base + sector). Now I get back this __pfn_t and I need >> to call >> kmap_atomic_pfn_t() which does a loop to search for my range and again >> base+offset ? >> >> This all model is broken, sorry? > > I think you are confused about the lifetime of the userspace DAX > mapping vs the kernel's mapping and the frequency of calls to > kmap_atomic_pfn_t(). I'm sure you can make this loop look bad with a > micro-benchmark, but the whole point of DAX is to get the kernel out > of the I/O path, so I'm not sure this overhead shows up in any real > way in practice. Sigh! It does. very much. 4k random write for you. Will drop in half if I do this. We've been testing with memory for a long time every rcu lock counts. A single atomic will drop things by %20 Thanks Boaz -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v5 4/5] dax: fix mapping lifetime handling, convert to __pfn_t + kmap_atomic_pfn_t()
On Wed, Aug 12, 2015 at 11:26 PM, Boaz Harrosh wrote: > B. Here this all set is a joke. The all "pmem disable vs still-in-use" > argument is mute > here below you have inserted a live, used for ever, pfn into a process vm > without holding > a map. Careful, don't confuse "unbind" with "unplug". "Unbind" invalidates the driver's mapping (ioremap) while "unplug" would invalidate the pfn. DAX is indeed broken with respect to unplug and we'll need to go solve that separately. I expect "unplug" support will be needed for hot provisioning pmem to/from virtual machines. > The all "pmem disable vs still-in-use" is a joke. The FS loaded has a > reference on the bdev > and the filehadle has a reference on the FS. So what is exactly this "pmem > disable" you are > talking about? Hmm, that's not the same block layer I've been working with for the past several years: $ mount /dev/pmem0 /mnt $ echo namespace0.0 > ../drivers/nd_pmem/unbind # succeeds Unbind always proceeds unconditionally. See the recent kernel summit topic discussion around devm vs unbind [1]. While kmap_atomic_pfn_t() does not implement revoke semantics it at least forces re-validation and time bounded references. For the unplug case we'll need to go shootdown those DAX mappings in userspace so that they return SIGBUS on access, or something along those lines. [1]: http://www.spinics.net/lists/kernel/msg2032864.html > And for god sake. I have a bdev I call bdev_direct_access(sector), the bdev > calculated the > exact address for me (base + sector). Now I get back this __pfn_t and I need > to call > kmap_atomic_pfn_t() which does a loop to search for my range and again > base+offset ? > > This all model is broken, sorry? I think you are confused about the lifetime of the userspace DAX mapping vs the kernel's mapping and the frequency of calls to kmap_atomic_pfn_t(). I'm sure you can make this loop look bad with a micro-benchmark, but the whole point of DAX is to get the kernel out of the I/O path, so I'm not sure this overhead shows up in any real way in practice. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v5 4/5] dax: fix mapping lifetime handling, convert to __pfn_t + kmap_atomic_pfn_t()
On 08/13/2015 06:01 AM, Dan Williams wrote: > The primary source for non-page-backed page-frames to enter the system > is via the pmem driver's ->direct_access() method. The pfns returned by > the top-level bdev_direct_access() may be passed to any other subsystem > in the kernel and those sub-systems either need to assume that the pfn > is page backed (CONFIG_DEV_PFN=n) or be prepared to handle non-page > backed case (CONFIG_DEV_PFN=y). Currently the pfns returned by > ->direct_access() are only ever used by vm_insert_mixed() which does not > care if the pfn is mapped. As we go to add more usages of these pfns > add the type-safety of __pfn_t. > > This also simplifies the calling convention of ->direct_access() by not > returning the virtual address in the same call. This annotates cases > where the kernel is directly accessing pmem outside the driver, and > makes the valid lifetime of the reference explicit. This property may > be useful in the future for invalidating mappings to pmem, but for now > it provides some protection against the "pmem disable vs still-in-use" > race. > > Note that axon_ram_direct_access and dcssblk_direct_access were > previously making potentially incorrect assumptions about the addresses > they passed to virt_to_phys(). > > [hch: various minor updates] > Signed-off-by: Christoph Hellwig > Signed-off-by: Dan Williams > --- > arch/powerpc/platforms/Kconfig |1 + > arch/powerpc/sysdev/axonram.c | 24 > drivers/block/brd.c|5 +-- > drivers/nvdimm/Kconfig |1 + > drivers/nvdimm/pmem.c | 24 +++- > drivers/s390/block/Kconfig |1 + > drivers/s390/block/dcssblk.c | 23 ++-- > fs/Kconfig |1 + > fs/block_dev.c |4 +- > fs/dax.c | 79 > +--- > include/linux/blkdev.h |7 ++-- > include/linux/mm.h | 12 ++ > 12 files changed, 129 insertions(+), 53 deletions(-) > > diff --git a/arch/powerpc/platforms/Kconfig b/arch/powerpc/platforms/Kconfig > index b7f9c408bf24..6b1c2f2e5fb4 100644 > --- a/arch/powerpc/platforms/Kconfig > +++ b/arch/powerpc/platforms/Kconfig > @@ -307,6 +307,7 @@ config CPM2 > config AXON_RAM > tristate "Axon DDR2 memory device driver" > depends on PPC_IBM_CELL_BLADE && BLOCK > + select KMAP_PFN > default m > help > It registers one block device per Axon's DDR2 memory bank found > diff --git a/arch/powerpc/sysdev/axonram.c b/arch/powerpc/sysdev/axonram.c > index e8657d3bc588..7c5a1563c0fd 100644 > --- a/arch/powerpc/sysdev/axonram.c > +++ b/arch/powerpc/sysdev/axonram.c > @@ -43,6 +43,7 @@ > #include > #include > #include > +#include > > #include > #include > @@ -141,14 +142,12 @@ axon_ram_make_request(struct request_queue *queue, > struct bio *bio) > */ > static long > axon_ram_direct_access(struct block_device *device, sector_t sector, > -void **kaddr, unsigned long *pfn) > +__pfn_t *pfn) > { > struct axon_ram_bank *bank = device->bd_disk->private_data; > loff_t offset = (loff_t)sector << AXON_RAM_SECTOR_SHIFT; > > - *kaddr = (void *)(bank->ph_addr + offset); > - *pfn = virt_to_phys(*kaddr) >> PAGE_SHIFT; > - > + *pfn = phys_to_pfn_t(bank->ph_addr + offset, PFN_DEV); > return bank->size - offset; > } > > @@ -165,9 +164,13 @@ static int axon_ram_probe(struct platform_device *device) > { > static int axon_ram_bank_id = -1; > struct axon_ram_bank *bank; > - struct resource resource; > + struct resource *resource; > int rc = 0; > > + resource = devm_kzalloc(&device->dev, sizeof(*resource), GFP_KERNEL); > + if (!resource) > + return -ENOMEM; > + > axon_ram_bank_id++; > > dev_info(&device->dev, "Found memory controller on %s\n", > @@ -184,13 +187,13 @@ static int axon_ram_probe(struct platform_device > *device) > > bank->device = device; > > - if (of_address_to_resource(device->dev.of_node, 0, &resource) != 0) { > + if (of_address_to_resource(device->dev.of_node, 0, resource) != 0) { > dev_err(&device->dev, "Cannot access device tree\n"); > rc = -EFAULT; > goto failed; > } > > - bank->size = resource_size(&resource); > + bank->size = resource_size(resource); > > if (bank->size == 0) { > dev_err(&device->dev, "No DDR2 memory found for %s%d\n", > @@ -202,7 +205,7 @@ static int axon_ram_probe(struct platform_device *device) > dev_info(&device->dev, "Register DDR2 memory device %s%d with %luMB\n", > AXON_RAM_DEVICE_NAME, axon_ram_bank_id, bank->size >> > 20); > > - bank->ph_addr = resource.start; > + bank->ph_addr = resource->start; > bank->io_addr = (unsigned long) ioremap_prot( > bank->ph_addr,