On 26/01/08 11:31AM, Jonathan Cameron wrote:
> On Wed,  7 Jan 2026 09:33:11 -0600
> John Groves <[email protected]> wrote:
> 
> > The new fsdev driver provides pages/folios initialized compatibly with
> > fsdax - normal rather than devdax-style refcounting, and starting out
> > with order-0 folios.
> > 
> > When fsdev binds to a daxdev, it is usually (always?) switching from the
> > devdax mode (device.c), which pre-initializes compound folios according
> > to its alignment. Fsdev uses fsdev_clear_folio_state() to switch the
> > folios into a fsdax-compatible state.
> > 
> > A side effect of this is that raw mmap doesn't (can't?) work on an fsdev
> > dax instance. Accordingly, The fsdev driver does not provide raw mmap -
> > devices must be put in 'devdax' mode (drivers/dax/device.c) to get raw
> > mmap capability.
> > 
> > In this commit is just the framework, which remaps pages/folios compatibly
> > with fsdax.
> > 
> > Enabling dax changes:
> > 
> > * bus.h: add DAXDRV_FSDEV_TYPE driver type
> > * bus.c: allow DAXDRV_FSDEV_TYPE drivers to bind to daxdevs
> > * dax.h: prototype inode_dax(), which fsdev needs
> > 
> > Suggested-by: Dan Williams <[email protected]>
> > Suggested-by: Gregory Price <[email protected]>
> > Signed-off-by: John Groves <[email protected]>
> 
> > diff --git a/drivers/dax/Kconfig b/drivers/dax/Kconfig
> > index d656e4c0eb84..491325d914a8 100644
> > --- a/drivers/dax/Kconfig
> > +++ b/drivers/dax/Kconfig
> > @@ -78,4 +78,21 @@ config DEV_DAX_KMEM
> >  
> >       Say N if unsure.
> >  
> > +config DEV_DAX_FS
> > +   tristate "FSDEV DAX: fs-dax compatible device driver"
> > +   depends on DEV_DAX
> > +   default DEV_DAX
> 
> What's the logic for the default? Generally I'd not expect a
> default for something new like this (so default of default == no)
> 
> > +   help
> > +     Support a device-dax driver mode that is compatible with fs-dax
> 
> ...
> 
> 
> 
> >  struct dax_device_driver {
> > diff --git a/drivers/dax/fsdev.c b/drivers/dax/fsdev.c
> > new file mode 100644
> > index 000000000000..2a3249d1529c
> > --- /dev/null
> > +++ b/drivers/dax/fsdev.c
> > @@ -0,0 +1,276 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +/* Copyright(c) 2026 Micron Technology, Inc. */
> > +#include <linux/memremap.h>
> > +#include <linux/pagemap.h>
> > +#include <linux/module.h>
> > +#include <linux/device.h>
> > +#include <linux/cdev.h>
> > +#include <linux/slab.h>
> > +#include <linux/dax.h>
> > +#include <linux/fs.h>
> > +#include <linux/mm.h>
> > +#include "dax-private.h"
> > +#include "bus.h"
> 
> ...
> 
> > +static void fsdev_cdev_del(void *cdev)
> > +{
> > +   cdev_del(cdev);
> > +}
> > +
> > +static void fsdev_kill(void *dev_dax)
> > +{
> > +   kill_dev_dax(dev_dax);
> > +}
> 
> ...
> 
> > +/*
> > + * Clear any stale folio state from pages in the given range.
> > + * This is necessary because device_dax pre-initializes compound folios
> > + * based on vmemmap_shift, and that state may persist after driver unbind.
> 
> What's the argument for not cleaning these out in the unbind path for 
> device_dax?
> I can see that it might be an optimization if some other code path blindly
> overwrites all this state.

I prefer this because it doesn't rely on some other module having done the
right thing. Dax maintainers might have thoughts too though.

> 
> > + * Since fsdev_dax uses MEMORY_DEVICE_FS_DAX without vmemmap_shift, fs-dax
> > + * expects to find clean order-0 folios that it can build into compound
> > + * folios on demand.
> > + *
> > + * At probe time, no filesystem should be mounted yet, so all mappings
> > + * are stale and must be cleared along with compound state.
> > + */
> > +static void fsdev_clear_folio_state(struct dev_dax *dev_dax)
> > +{
> > +   int i;
> 
> It's becoming increasingly common to declare loop variables as
> for (int i = 0; i <...
> 
> and given that saves us a few lines here it seems worth doing.

Done thanks

> 
> > +
> > +   for (i = 0; i < dev_dax->nr_range; i++) {
> > +           struct range *range = &dev_dax->ranges[i].range;
> > +           unsigned long pfn, end_pfn;
> > +
> > +           pfn = PHYS_PFN(range->start);
> > +           end_pfn = PHYS_PFN(range->end) + 1;
> 
> Might as well do
>               unsigned long pfn = PHY_PFN(range->start);
>               unsigned long end_pfn = PHYS_PFN(range->end) + 1;

Sounds good, done

> > +
> > +           while (pfn < end_pfn) {
> > +                   struct page *page = pfn_to_page(pfn);
> > +                   struct folio *folio = (struct folio *)page;
> > +                   struct dev_pagemap *pgmap = page_pgmap(page);
> > +                   int order = folio_order(folio);
> > +
> > +                   /*
> > +                    * Clear any stale mapping pointer. At probe time,
> > +                    * no filesystem is mounted, so any mapping is stale.
> > +                    */
> > +                   folio->mapping = NULL;
> > +                   folio->share = 0;
> > +
> > +                   if (order > 0) {
> > +                           int j;
> > +
> > +                           folio_reset_order(folio);
> > +                           for (j = 0; j < (1UL << order); j++) {
> > +                                   struct page *p = page + j;
> > +
> > +                                   ClearPageHead(p);
> > +                                   clear_compound_head(p);
> > +                                   ((struct folio *)p)->mapping = NULL;
> 
> This code block is very similar to a chunk in dax_folio_put() in fs/dax.c
> 
> Can we create a helper for both to use?
> 
> I note that uses a local struct folio *new_folio to avoid multiple casts.
> I'd do similar here even if it's a long line.
>  
> If not possible to use a common helper, it is probably still worth
> having a helper here for the stuff in the while loop just to reduce indent
> and improve readability a little.

Good catch! You shall have a Suggested-by in the next version, which will
inject a commit right before this that factors out dax_folio_reset_order()
from dax_folio_put(). Then fsdev_clear_folio_state() will also call that.

> 
> > +                                   ((struct folio *)p)->share = 0;
> > +                                   ((struct folio *)p)->pgmap = pgmap;
> > +                           }
> > +                           pfn += (1UL << order);
> > +                   } else {
> > +                           folio->pgmap = pgmap;
> > +                           pfn++;
> > +                   }
> > +           }
> > +   }
> > +}
> > +
> > +static int fsdev_open(struct inode *inode, struct file *filp)
> > +{
> > +   struct dax_device *dax_dev = inode_dax(inode);
> > +   struct dev_dax *dev_dax = dax_get_private(dax_dev);
> > +
> > +   dev_dbg(&dev_dax->dev, "trace\n");
> 
> Hmm. This is a somewhat odd, but I see dax/device.c does
> the same thing and I guess that's because you are using
> dynamic debug with function names turned on to provide the
> 'real' information.

Actually I just have it from the gut-and-repurpose of device.c.
Dropping from fsdev.c as I'm not using it.

> 
> 
> 
> > +   filp->private_data = dev_dax;
> > +
> > +   return 0;
> > +}
> 
> > +static int fsdev_dax_probe(struct dev_dax *dev_dax)
> > +{
> > +   struct dax_device *dax_dev = dev_dax->dax_dev;
> > +   struct device *dev = &dev_dax->dev;
> > +   struct dev_pagemap *pgmap;
> > +   u64 data_offset = 0;
> > +   struct inode *inode;
> > +   struct cdev *cdev;
> > +   void *addr;
> > +   int rc, i;
> > +
> 
> A bunch of this is cut and paste from dax/device.c
> If it carries on looking like this, can we have a helper module that
> both drivers use with the common code in it? That would make the
> difference more obvious as well.

Makes sense. I'll wait for thoughts from the dax people before
flipping bits on this though.

> 
> > +   if (static_dev_dax(dev_dax))  {
> > +           if (dev_dax->nr_range > 1) {
> > +                   dev_warn(dev,
> > +                           "static pgmap / multi-range device conflict\n");
> > +                   return -EINVAL;
> > +           }
> > +
> > +           pgmap = dev_dax->pgmap;
> > +   } else {
> > +           if (dev_dax->pgmap) {
> > +                   dev_warn(dev,
> > +                            "dynamic-dax with pre-populated page map\n");
> Unless dax maintainers are very fussy about 80 chars, I'd go long on these as 
> it's
> only just over 80 chars on one line.
> 
> Given you are failing probe, not sure why dev_warn() is considered sufficient.
> To me dev_err() seems more sensible. What you have matches dax/device.c though
> so maybe there is a sound reason.

I'm personally a bit fussy about 80 column code, being kinda old and favoring
80 column emacs windows :D - mulling it over.

> 
> > +                   return -EINVAL;
> > +           }
> > +
> > +           pgmap = devm_kzalloc(dev,
> > +                   struct_size(pgmap, ranges, dev_dax->nr_range - 1),
> > +                                GFP_KERNEL);
> Pick an alignment style and stick to it.  Either.
>               pgmap = devm_kzalloc(dev,
>                       struct_size(pgmap, ranges, dev_dax->nr_range - 1),
>                       GFP_KERNEL);
> 
> or go long for readability and do
>               pgmap = devm_kzalloc(dev,
>                                    struct_size(pgmap, ranges, 
> dev_dax->nr_range - 1),
>                                    GFP_KERNEL);

Will do something cleaner. This is the aforementioned 80 column curmudgeonliness
at work...

> 
> 
> 
> > +           if (!pgmap)
> > +                   return -ENOMEM;
> > +
> > +           pgmap->nr_range = dev_dax->nr_range;
> > +           dev_dax->pgmap = pgmap;
> > +
> > +           for (i = 0; i < dev_dax->nr_range; i++) {
> > +                   struct range *range = &dev_dax->ranges[i].range;
> > +
> > +                   pgmap->ranges[i] = *range;
> > +           }
> > +   }
> > +
> > +   for (i = 0; i < dev_dax->nr_range; i++) {
> > +           struct range *range = &dev_dax->ranges[i].range;
> > +
> > +           if (!devm_request_mem_region(dev, range->start,
> > +                                   range_len(range), dev_name(dev))) {
> > +                   dev_warn(dev, "mapping%d: %#llx-%#llx could not reserve 
> > range\n",
> > +                                   i, range->start, range->end);
> > +                   return -EBUSY;
> > +           }
> > +   }
> > +
> > +   /*
> > +    * FS-DAX compatible mode: Use MEMORY_DEVICE_FS_DAX type and
> > +    * do NOT set vmemmap_shift. This leaves folios at order-0,
> > +    * allowing fs-dax to dynamically create compound folios as needed
> > +    * (similar to pmem behavior).
> > +    */
> > +   pgmap->type = MEMORY_DEVICE_FS_DAX;
> > +   pgmap->ops = &fsdev_pagemap_ops;
> > +   pgmap->owner = dev_dax;
> > +
> > +   /*
> > +    * CRITICAL DIFFERENCE from device.c:
> > +    * We do NOT set vmemmap_shift here, even if align > PAGE_SIZE.
> > +    * This ensures folios remain order-0 and are compatible with
> > +    * fs-dax's folio management.
> > +    */
> > +
> > +   addr = devm_memremap_pages(dev, pgmap);
> > +   if (IS_ERR(addr))
> > +           return PTR_ERR(addr);
> > +
> > +   /*
> > +    * Clear any stale compound folio state left over from a previous
> > +    * driver (e.g., device_dax with vmemmap_shift).
> > +    */
> > +   fsdev_clear_folio_state(dev_dax);
> > +
> > +   /* Detect whether the data is at a non-zero offset into the memory */
> > +   if (pgmap->range.start != dev_dax->ranges[0].range.start) {
> > +           u64 phys = dev_dax->ranges[0].range.start;
> > +           u64 pgmap_phys = dev_dax->pgmap[0].range.start;
> > +
> > +           if (!WARN_ON(pgmap_phys > phys))
> > +                   data_offset = phys - pgmap_phys;
> > +
> > +           pr_debug("%s: offset detected phys=%llx pgmap_phys=%llx 
> > offset=%llx\n",
> > +                  __func__, phys, pgmap_phys, data_offset);
> > +   }
> > +
> > +   inode = dax_inode(dax_dev);
> > +   cdev = inode->i_cdev;
> > +   cdev_init(cdev, &fsdev_fops);
> > +   cdev->owner = dev->driver->owner;
> > +   cdev_set_parent(cdev, &dev->kobj);
> > +   rc = cdev_add(cdev, dev->devt, 1);
> > +   if (rc)
> > +           return rc;
> > +
> > +   rc = devm_add_action_or_reset(dev, fsdev_cdev_del, cdev);
> > +   if (rc)
> > +           return rc;
> > +
> > +   run_dax(dax_dev);
> > +   return devm_add_action_or_reset(dev, fsdev_kill, dev_dax);
> > +}
> > +
> > +static struct dax_device_driver fsdev_dax_driver = {
> > +   .probe = fsdev_dax_probe,
> > +   .type = DAXDRV_FSDEV_TYPE,
> > +};
> > +
> > +static int __init dax_init(void)
> > +{
> > +   return dax_driver_register(&fsdev_dax_driver);
> > +}
> > +
> > +static void __exit dax_exit(void)
> > +{
> > +   dax_driver_unregister(&fsdev_dax_driver);
> > +}
> If these don't get more complex, maybe it's time for a dax specific define
> using module_driver()

I'll defer to the dax folks here

> 
> > +
> > +MODULE_AUTHOR("John Groves");
> > +MODULE_DESCRIPTION("FS-DAX Device: fs-dax compatible devdax driver");
> > +MODULE_LICENSE("GPL");
> > +module_init(dax_init);
> > +module_exit(dax_exit);
> > +MODULE_ALIAS_DAX_DEVICE(0);
> 
> Curious macro. Always has same parameter...  Maybe ripe for just dropping the 
> parameter?
> 
> 

Thanks!
John

Reply via email to