On 26/01/08 11:31AM, Jonathan Cameron wrote:
> On Wed, 7 Jan 2026 09:33:11 -0600
> John Groves <[email protected]> wrote:
>
> > The new fsdev driver provides pages/folios initialized compatibly with
> > fsdax - normal rather than devdax-style refcounting, and starting out
> > with order-0 folios.
> >
> > When fsdev binds to a daxdev, it is usually (always?) switching from the
> > devdax mode (device.c), which pre-initializes compound folios according
> > to its alignment. Fsdev uses fsdev_clear_folio_state() to switch the
> > folios into a fsdax-compatible state.
> >
> > A side effect of this is that raw mmap doesn't (can't?) work on an fsdev
> > dax instance. Accordingly, The fsdev driver does not provide raw mmap -
> > devices must be put in 'devdax' mode (drivers/dax/device.c) to get raw
> > mmap capability.
> >
> > In this commit is just the framework, which remaps pages/folios compatibly
> > with fsdax.
> >
> > Enabling dax changes:
> >
> > * bus.h: add DAXDRV_FSDEV_TYPE driver type
> > * bus.c: allow DAXDRV_FSDEV_TYPE drivers to bind to daxdevs
> > * dax.h: prototype inode_dax(), which fsdev needs
> >
> > Suggested-by: Dan Williams <[email protected]>
> > Suggested-by: Gregory Price <[email protected]>
> > Signed-off-by: John Groves <[email protected]>
>
> > diff --git a/drivers/dax/Kconfig b/drivers/dax/Kconfig
> > index d656e4c0eb84..491325d914a8 100644
> > --- a/drivers/dax/Kconfig
> > +++ b/drivers/dax/Kconfig
> > @@ -78,4 +78,21 @@ config DEV_DAX_KMEM
> >
> > Say N if unsure.
> >
> > +config DEV_DAX_FS
> > + tristate "FSDEV DAX: fs-dax compatible device driver"
> > + depends on DEV_DAX
> > + default DEV_DAX
>
> What's the logic for the default? Generally I'd not expect a
> default for something new like this (so default of default == no)
>
> > + help
> > + Support a device-dax driver mode that is compatible with fs-dax
>
> ...
>
>
>
> > struct dax_device_driver {
> > diff --git a/drivers/dax/fsdev.c b/drivers/dax/fsdev.c
> > new file mode 100644
> > index 000000000000..2a3249d1529c
> > --- /dev/null
> > +++ b/drivers/dax/fsdev.c
> > @@ -0,0 +1,276 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +/* Copyright(c) 2026 Micron Technology, Inc. */
> > +#include <linux/memremap.h>
> > +#include <linux/pagemap.h>
> > +#include <linux/module.h>
> > +#include <linux/device.h>
> > +#include <linux/cdev.h>
> > +#include <linux/slab.h>
> > +#include <linux/dax.h>
> > +#include <linux/fs.h>
> > +#include <linux/mm.h>
> > +#include "dax-private.h"
> > +#include "bus.h"
>
> ...
>
> > +static void fsdev_cdev_del(void *cdev)
> > +{
> > + cdev_del(cdev);
> > +}
> > +
> > +static void fsdev_kill(void *dev_dax)
> > +{
> > + kill_dev_dax(dev_dax);
> > +}
>
> ...
>
> > +/*
> > + * Clear any stale folio state from pages in the given range.
> > + * This is necessary because device_dax pre-initializes compound folios
> > + * based on vmemmap_shift, and that state may persist after driver unbind.
>
> What's the argument for not cleaning these out in the unbind path for
> device_dax?
> I can see that it might be an optimization if some other code path blindly
> overwrites all this state.
I prefer this because it doesn't rely on some other module having done the
right thing. Dax maintainers might have thoughts too though.
>
> > + * Since fsdev_dax uses MEMORY_DEVICE_FS_DAX without vmemmap_shift, fs-dax
> > + * expects to find clean order-0 folios that it can build into compound
> > + * folios on demand.
> > + *
> > + * At probe time, no filesystem should be mounted yet, so all mappings
> > + * are stale and must be cleared along with compound state.
> > + */
> > +static void fsdev_clear_folio_state(struct dev_dax *dev_dax)
> > +{
> > + int i;
>
> It's becoming increasingly common to declare loop variables as
> for (int i = 0; i <...
>
> and given that saves us a few lines here it seems worth doing.
Done thanks
>
> > +
> > + for (i = 0; i < dev_dax->nr_range; i++) {
> > + struct range *range = &dev_dax->ranges[i].range;
> > + unsigned long pfn, end_pfn;
> > +
> > + pfn = PHYS_PFN(range->start);
> > + end_pfn = PHYS_PFN(range->end) + 1;
>
> Might as well do
> unsigned long pfn = PHY_PFN(range->start);
> unsigned long end_pfn = PHYS_PFN(range->end) + 1;
Sounds good, done
> > +
> > + while (pfn < end_pfn) {
> > + struct page *page = pfn_to_page(pfn);
> > + struct folio *folio = (struct folio *)page;
> > + struct dev_pagemap *pgmap = page_pgmap(page);
> > + int order = folio_order(folio);
> > +
> > + /*
> > + * Clear any stale mapping pointer. At probe time,
> > + * no filesystem is mounted, so any mapping is stale.
> > + */
> > + folio->mapping = NULL;
> > + folio->share = 0;
> > +
> > + if (order > 0) {
> > + int j;
> > +
> > + folio_reset_order(folio);
> > + for (j = 0; j < (1UL << order); j++) {
> > + struct page *p = page + j;
> > +
> > + ClearPageHead(p);
> > + clear_compound_head(p);
> > + ((struct folio *)p)->mapping = NULL;
>
> This code block is very similar to a chunk in dax_folio_put() in fs/dax.c
>
> Can we create a helper for both to use?
>
> I note that uses a local struct folio *new_folio to avoid multiple casts.
> I'd do similar here even if it's a long line.
>
> If not possible to use a common helper, it is probably still worth
> having a helper here for the stuff in the while loop just to reduce indent
> and improve readability a little.
Good catch! You shall have a Suggested-by in the next version, which will
inject a commit right before this that factors out dax_folio_reset_order()
from dax_folio_put(). Then fsdev_clear_folio_state() will also call that.
>
> > + ((struct folio *)p)->share = 0;
> > + ((struct folio *)p)->pgmap = pgmap;
> > + }
> > + pfn += (1UL << order);
> > + } else {
> > + folio->pgmap = pgmap;
> > + pfn++;
> > + }
> > + }
> > + }
> > +}
> > +
> > +static int fsdev_open(struct inode *inode, struct file *filp)
> > +{
> > + struct dax_device *dax_dev = inode_dax(inode);
> > + struct dev_dax *dev_dax = dax_get_private(dax_dev);
> > +
> > + dev_dbg(&dev_dax->dev, "trace\n");
>
> Hmm. This is a somewhat odd, but I see dax/device.c does
> the same thing and I guess that's because you are using
> dynamic debug with function names turned on to provide the
> 'real' information.
Actually I just have it from the gut-and-repurpose of device.c.
Dropping from fsdev.c as I'm not using it.
>
>
>
> > + filp->private_data = dev_dax;
> > +
> > + return 0;
> > +}
>
> > +static int fsdev_dax_probe(struct dev_dax *dev_dax)
> > +{
> > + struct dax_device *dax_dev = dev_dax->dax_dev;
> > + struct device *dev = &dev_dax->dev;
> > + struct dev_pagemap *pgmap;
> > + u64 data_offset = 0;
> > + struct inode *inode;
> > + struct cdev *cdev;
> > + void *addr;
> > + int rc, i;
> > +
>
> A bunch of this is cut and paste from dax/device.c
> If it carries on looking like this, can we have a helper module that
> both drivers use with the common code in it? That would make the
> difference more obvious as well.
Makes sense. I'll wait for thoughts from the dax people before
flipping bits on this though.
>
> > + if (static_dev_dax(dev_dax)) {
> > + if (dev_dax->nr_range > 1) {
> > + dev_warn(dev,
> > + "static pgmap / multi-range device conflict\n");
> > + return -EINVAL;
> > + }
> > +
> > + pgmap = dev_dax->pgmap;
> > + } else {
> > + if (dev_dax->pgmap) {
> > + dev_warn(dev,
> > + "dynamic-dax with pre-populated page map\n");
> Unless dax maintainers are very fussy about 80 chars, I'd go long on these as
> it's
> only just over 80 chars on one line.
>
> Given you are failing probe, not sure why dev_warn() is considered sufficient.
> To me dev_err() seems more sensible. What you have matches dax/device.c though
> so maybe there is a sound reason.
I'm personally a bit fussy about 80 column code, being kinda old and favoring
80 column emacs windows :D - mulling it over.
>
> > + return -EINVAL;
> > + }
> > +
> > + pgmap = devm_kzalloc(dev,
> > + struct_size(pgmap, ranges, dev_dax->nr_range - 1),
> > + GFP_KERNEL);
> Pick an alignment style and stick to it. Either.
> pgmap = devm_kzalloc(dev,
> struct_size(pgmap, ranges, dev_dax->nr_range - 1),
> GFP_KERNEL);
>
> or go long for readability and do
> pgmap = devm_kzalloc(dev,
> struct_size(pgmap, ranges,
> dev_dax->nr_range - 1),
> GFP_KERNEL);
Will do something cleaner. This is the aforementioned 80 column curmudgeonliness
at work...
>
>
>
> > + if (!pgmap)
> > + return -ENOMEM;
> > +
> > + pgmap->nr_range = dev_dax->nr_range;
> > + dev_dax->pgmap = pgmap;
> > +
> > + for (i = 0; i < dev_dax->nr_range; i++) {
> > + struct range *range = &dev_dax->ranges[i].range;
> > +
> > + pgmap->ranges[i] = *range;
> > + }
> > + }
> > +
> > + for (i = 0; i < dev_dax->nr_range; i++) {
> > + struct range *range = &dev_dax->ranges[i].range;
> > +
> > + if (!devm_request_mem_region(dev, range->start,
> > + range_len(range), dev_name(dev))) {
> > + dev_warn(dev, "mapping%d: %#llx-%#llx could not reserve
> > range\n",
> > + i, range->start, range->end);
> > + return -EBUSY;
> > + }
> > + }
> > +
> > + /*
> > + * FS-DAX compatible mode: Use MEMORY_DEVICE_FS_DAX type and
> > + * do NOT set vmemmap_shift. This leaves folios at order-0,
> > + * allowing fs-dax to dynamically create compound folios as needed
> > + * (similar to pmem behavior).
> > + */
> > + pgmap->type = MEMORY_DEVICE_FS_DAX;
> > + pgmap->ops = &fsdev_pagemap_ops;
> > + pgmap->owner = dev_dax;
> > +
> > + /*
> > + * CRITICAL DIFFERENCE from device.c:
> > + * We do NOT set vmemmap_shift here, even if align > PAGE_SIZE.
> > + * This ensures folios remain order-0 and are compatible with
> > + * fs-dax's folio management.
> > + */
> > +
> > + addr = devm_memremap_pages(dev, pgmap);
> > + if (IS_ERR(addr))
> > + return PTR_ERR(addr);
> > +
> > + /*
> > + * Clear any stale compound folio state left over from a previous
> > + * driver (e.g., device_dax with vmemmap_shift).
> > + */
> > + fsdev_clear_folio_state(dev_dax);
> > +
> > + /* Detect whether the data is at a non-zero offset into the memory */
> > + if (pgmap->range.start != dev_dax->ranges[0].range.start) {
> > + u64 phys = dev_dax->ranges[0].range.start;
> > + u64 pgmap_phys = dev_dax->pgmap[0].range.start;
> > +
> > + if (!WARN_ON(pgmap_phys > phys))
> > + data_offset = phys - pgmap_phys;
> > +
> > + pr_debug("%s: offset detected phys=%llx pgmap_phys=%llx
> > offset=%llx\n",
> > + __func__, phys, pgmap_phys, data_offset);
> > + }
> > +
> > + inode = dax_inode(dax_dev);
> > + cdev = inode->i_cdev;
> > + cdev_init(cdev, &fsdev_fops);
> > + cdev->owner = dev->driver->owner;
> > + cdev_set_parent(cdev, &dev->kobj);
> > + rc = cdev_add(cdev, dev->devt, 1);
> > + if (rc)
> > + return rc;
> > +
> > + rc = devm_add_action_or_reset(dev, fsdev_cdev_del, cdev);
> > + if (rc)
> > + return rc;
> > +
> > + run_dax(dax_dev);
> > + return devm_add_action_or_reset(dev, fsdev_kill, dev_dax);
> > +}
> > +
> > +static struct dax_device_driver fsdev_dax_driver = {
> > + .probe = fsdev_dax_probe,
> > + .type = DAXDRV_FSDEV_TYPE,
> > +};
> > +
> > +static int __init dax_init(void)
> > +{
> > + return dax_driver_register(&fsdev_dax_driver);
> > +}
> > +
> > +static void __exit dax_exit(void)
> > +{
> > + dax_driver_unregister(&fsdev_dax_driver);
> > +}
> If these don't get more complex, maybe it's time for a dax specific define
> using module_driver()
I'll defer to the dax folks here
>
> > +
> > +MODULE_AUTHOR("John Groves");
> > +MODULE_DESCRIPTION("FS-DAX Device: fs-dax compatible devdax driver");
> > +MODULE_LICENSE("GPL");
> > +module_init(dax_init);
> > +module_exit(dax_exit);
> > +MODULE_ALIAS_DAX_DEVICE(0);
>
> Curious macro. Always has same parameter... Maybe ripe for just dropping the
> parameter?
>
>
Thanks!
John