Hi John,

I started to do reviews based on design documents for new features. I
think in general it is rather hard for humans to kind of reverse
engineer the design from the patch series. With AI it got easier, but
still should be verified by the author. Could you check if the attached
AI generated document is correct?


Thanks,
Bernd
================================================================================
famfs (FUSE-based fabric-attached memory file system) - Design Document
================================================================================

Audience and scope
==================
This document is written for people already familiar with FUSE (lowlevel ops,
opcodes, INIT capability negotiation) but NOT necessarily with Linux DAX,
devdax, or the kernel's iomap framework. Section 2 is a primer on those.

It covers two trees:

  Kernel:  /home/bernd/src/linux/linux.git , branch `famfs`,
           commits 4a8ae428c392 .. HEAD (da9edf77cbc4)

  libfuse: /home/bernd/src/libfuse/libfuse.git , branch `famfs`,
           commits d75ae2ee .. HEAD (9c65d781)

Kernel files added or changed:

    fs/fuse/famfs.c              - new, all famfs kernel logic
    fs/fuse/famfs_kfmap.h        - new, in-memory fmap structures
    fs/fuse/fuse_i.h             - per-inode/per-conn famfs additions, helpers
    fs/fuse/file.c               - r/w/mmap dispatch into famfs paths
    fs/fuse/inode.c              - INIT-flag negotiation, conn teardown wiring
    fs/fuse/iomode.c             - bypass io-modes for famfs files
    fs/fuse/Kconfig, Makefile    - new CONFIG_FUSE_FAMFS_DAX
    include/uapi/linux/fuse.h    - new opcodes, structs, FUSE_DAX_FMAP flag
    fs/namei.c                   - export may_open_dev()
    Documentation/filesystems/famfs.rst - user/admin documentation

libfuse files added or changed:

    include/fuse_kernel.h        - mirror of kernel uapi at protocol 7.46
    include/fuse_common.h        - new FUSE_CAP_DAX_FMAP capability bit
    include/fuse_lowlevel.h      - new ops: get_fmap(), get_daxdev()
    lib/fuse_lowlevel.c          - INIT negotiation + opcode dispatch
                                   (do_get_fmap, do_get_daxdev)


--------------------------------------------------------------------------------
1. Background and goals
--------------------------------------------------------------------------------

Famfs exposes shared, fabric-attached memory (CXL devdax) as a regular
filesystem. The fast path (read/write/mmap-fault) must reach memory without a
round trip to the FUSE server: the server only delivers metadata.

Two key observations shape the design:

    * Files are NEVER allocated in the kernel. Userspace pre-allocates extents
      and gives the kernel an "fmap" (file-to-dax-offset map).
    * There is NO writeback. Backing memory is the storage; CPU caches are
      loaded directly from the dax memory.

Consequences in the kernel:

    * No page cache is used. `noop_dirty_folio` is the only address_space op.
    * The kernel never grows or shrinks files. Any size change (including
      truncate) puts the file into an "error" state.
    * Reads/writes/mmap dispatch through `dax_iomap_*()` and the famfs
      `iomap_ops`, exactly the way fs-dax filesystems (xfs/ext4) plumb them.

Comparison to other FUSE modes that you may know:

    classic FUSE   - every read/write/mmap is forwarded to the server.
    virtio-fs DAX  - the server donates a window of host memory; kernel maps
                     file ranges into that window via FUSE_SETUPMAPPING /
                     FUSE_REMOVEMAPPING. The server is still the "owner" of
                     the backing memory.
    famfs (this)   - the server hands the kernel a description of where each
                     file's bytes live on a real character device (devdax).
                     After that, the server is OUT of the data path entirely.


--------------------------------------------------------------------------------
2. Primer: devdax, DAX and iomap (only what's needed below)
--------------------------------------------------------------------------------

You can skip this section if "iomap_begin", "dax_iomap_rw" and "devdax holder"
already mean something to you.

devdax
    A character device (`/dev/daxN.M`) that exposes a contiguous range of
    physical memory directly to userspace via mmap. There is no page cache
    and no block device underneath; reads and writes hit RAM/CXL memory
    directly. Famfs uses devdax devices as its "disks".

DAX (Direct Access)
    A kernel pathway that lets a filesystem map file pages straight onto
    the underlying memory pages (PFNs) without going through the page cache.
    A file/inode tagged with `S_DAX` opts in. Reads turn into memcpy from
    the memory; mmap faults install the memory's PFN directly into the page
    table (PTE/PMD/PUD).

iomap
    A filesystem-agnostic mechanism that says "to do this read/write/fault
    on this file at this offset and length, here is exactly which device,
    which device-relative offset, and how many bytes are valid here."
    Filesystems implement `struct iomap_ops`, of which the central callback
    is:

        .iomap_begin(inode, file_offset, length, flags,
                     struct iomap *out, struct iomap *srcmap)

    The filesystem fills `out` with:
        out->dax_dev  - which DAX device backs this range
        out->addr     - byte offset within that DAX device
        out->offset   - file offset (echoed back)
        out->length   - how many contiguous bytes are valid here
        out->type     - IOMAP_MAPPED (famfs only ever returns this)

    The DAX core then loops, calling `iomap_begin` repeatedly to walk the
    requested range and, for each chunk, doing either:
        - memcpy to/from `dax_dev + addr`        (read/write)
        - or installing the PFN at `dax_dev + addr` into a page table (faults)

    Entry points the famfs code uses:
        dax_iomap_rw(iocb, iter, ops)       - read/write
        dax_iomap_fault(vmf, order, ..., ops) - mmap PTE/PMD/PUD fault

dax holder
    DAX devices have a single "holder" - a struct (here `struct fuse_conn *`)
    that owns the device. Acquired via `fs_dax_get(devp, holder, holder_ops)`,
    released via `fs_put_dax(devp, holder)`. The holder gets called back via
    `holder_ops->notify_failure()` when the device reports memory poison.

That is the entire iomap-related vocabulary used in this document.


--------------------------------------------------------------------------------
3. Major kernel data structures
--------------------------------------------------------------------------------

(a) fuse_conn additions (fs/fuse/fuse_i.h):

    struct fuse_conn {
        ...
        unsigned int             famfs_iomap : 1;     /* negotiated at INIT */
        struct rw_semaphore      famfs_devlist_sem;   /* protects dax_devlist */
        struct famfs_dax_devlist *dax_devlist;        /* table of daxdevs */
    };

(b) fuse_inode additions:

    struct fuse_inode {
        ...
        void *famfs_meta;   /* struct famfs_file_meta *, NULL if not famfs */
    };

    A non-NULL `famfs_meta` is the marker for "this is a famfs file";
    `fuse_file_famfs(fi)` is just `READ_ONCE(fi->famfs_meta) != NULL`.

(c) Per-file metadata - struct famfs_file_meta (famfs_kfmap.h):

    +------------------------------------------------------+
    | struct famfs_file_meta                               |
    |   bool error                                         |
    |   enum famfs_file_type    file_type                  |
    |   size_t                  file_size                  |
    |   enum famfs_extent_type  fm_extent_type             |
    |   u64                     dev_bitmap                 |
    |   union {                                            |
    |     SIMPLE:                                          |
    |       size_t fm_nextents                             |
    |       struct famfs_meta_simple_ext *se               |
    |     INTERLEAVED:                                     |
    |       size_t fm_niext                                |
    |       struct famfs_meta_interleaved_ext *ie          |
    |   }                                                  |
    +------------------------------------------------------+

    Simple extent:        (dev_index, ext_offset, ext_len)
    Interleaved extent:   (nstrips, chunk_size, nbytes, strips[])
                          where each strip is a simple extent.

(d) Per-conn dax device table:

    +-------------------------+      +------------------------------+
    | famfs_dax_devlist       |      | famfs_daxdev[MAX_DAXDEVS=24] |
    |   nslots = MAX_DAXDEVS  |----->|                              |
    |   ndevs                 |      |  [0] valid? devp, devno, ... |
    |   devlist *-------------|      |  [1] valid? devp, devno, ... |
    +-------------------------+      |  ...                         |
                                     +------------------------------+

    famfs_daxdev fields:
        valid         - slot has been populated (after wmb)
        error         - dax notify_failure() arrived (poison)
        dax_err       - fs_dax_get() failed; cannot be used
        devno, devp   - dev_t and dax_device pointer
        name          - chrdev pathname for diagnostics


--------------------------------------------------------------------------------
4. Capability negotiation (FUSE INIT)
--------------------------------------------------------------------------------

The wire-level capability is `FUSE_DAX_FMAP` (bit 43 in the 64-bit flags
field, protocol 7.46). Both ends must advertise it in INIT for the kernel
to enable famfs.

    Kernel                                          FUSE server (libfuse)
    ------                                          ---------------------
    fuse_new_init():
      flags |= FUSE_DAX_FMAP                 -- if capable(CAP_SYS_RAWIO)
      ----------- FUSE_INIT (in)  --------->
                                                    libfuse _do_init():
                                                      if (inargflags &
                                                          FUSE_DAX_FMAP)
                                                        conn.capable_ext |=
                                                          FUSE_CAP_DAX_FMAP
                                                    server's init_done CB
                                                    sets:
                                                      conn.want_ext |=
                                                        FUSE_CAP_DAX_FMAP
                                                    libfuse converts that
                                                    back to FUSE_DAX_FMAP
                                                    in outargflags

      <----------- FUSE_INIT (out) ----------

    process_init_reply():
      if reply.flags & FUSE_DAX_FMAP &&
         in.flags also had FUSE_DAX_FMAP:
            famfs_init_devlist_sem(fc)
            fc->famfs_iomap = 1

Both directions must agree. The kernel re-checks the flag in `in.flags` on the
reply path because process_init_reply() does not run in the server's task
context, so capable() cannot be re-evaluated then; the bit on the way OUT
asserts "the user that mounted us had CAP_SYS_RAWIO".

    Kernel file:  fs/fuse/inode.c (fuse_new_init, process_init_reply)
    libfuse file: lib/fuse_lowlevel.c (_do_init)


--------------------------------------------------------------------------------
5. libfuse server-side surface
--------------------------------------------------------------------------------

This is the userspace API a famfs server is expected to implement on top of
libfuse's lowlevel API. From a FUSE-developer point of view this is the
familiar pattern: two new opcodes, two new callbacks in
`struct fuse_lowlevel_ops`, and a new capability bit.

5.1 Capability bit (include/fuse_common.h)

    #define FUSE_CAP_DAX_FMAP   (1UL << 32)

  This sits in the *extended* capability fields `want_ext` / `capable_ext`,
  not the legacy 32-bit `want` / `capable`, because bit 32 is past the end
  of the original word.

5.2 New lowlevel callbacks (include/fuse_lowlevel.h)

    struct fuse_lowlevel_ops {
        ...
        /* Reply: serialized fuse_famfs_fmap_header followed by extents */
        void (*get_fmap)   (fuse_req_t req, fuse_ino_t ino, size_t size);

        /* Reply: serialized fuse_daxdev_out (mainly: char name[256]) */
        void (*get_daxdev) (fuse_req_t req, int daxdev_index);
    };

  Conventional libfuse semantics apply:
      - The callback may reply asynchronously.
      - Valid completions: fuse_reply_buf() with the serialized response,
        or fuse_reply_err(req, errno) on failure.
      - If the server does not provide either op, libfuse replies with
        -EOPNOTSUPP automatically.

5.3 Opcode dispatch (lib/fuse_lowlevel.c)

  Two entries are added to libfuse's opcode dispatch table:

      [FUSE_GET_FMAP]    = { do_get_fmap,   "GET_FMAP"   },
      [FUSE_GET_DAXDEV]  = { do_get_daxdev, "GET_DAXDEV" },

  do_get_fmap:
      reads `inarg` as `struct fuse_getxattr_in`*, extracts `arg->size`
      (the kernel's hint for the maximum reply size it can accept),
      forwards (req, ino, size) to op.get_fmap. The size is currently
      fixed at PAGE_SIZE on the kernel side (FMAP_BUFSIZE); a larger
      variable-size reply protocol is a future TODO.

  do_get_daxdev:
      ignores `inarg`. The kernel encodes the device index in `nodeid`
      (FUSE_GET_DAXDEV uses nodeid as a small integer, not a real inode),
      and libfuse forwards it as `daxdev_index` to op.get_daxdev.

5.4 Wire formats the server must produce

  Defined in include/fuse_kernel.h (libfuse's mirror of the kernel uapi):

      struct fuse_famfs_fmap_header {
          uint8_t  file_type;       /* enum fuse_famfs_file_type */
          uint8_t  reserved;
          uint16_t fmap_version;    /* FAMFS_FMAP_VERSION = 1 */
          uint32_t ext_type;        /* SIMPLE or INTERLEAVE */
          uint32_t nextents;
          uint32_t reserved0;
          uint64_t file_size;
          uint64_t reserved1;
      };

      struct fuse_famfs_simple_ext {
          uint32_t se_devindex;     /* index into the per-mount daxdev table */
          uint32_t reserved;
          uint64_t se_offset;       /* PMD-aligned offset in that daxdev */
          uint64_t se_len;          /* PMD-aligned length */
      };

      struct fuse_famfs_iext {       /* one interleaved extent */
          uint32_t ie_nstrips;
          uint32_t ie_chunk_size;   /* PMD-aligned */
          uint64_t ie_nbytes;       /* total bytes covered by this extent */
          uint64_t reserved;
      };

      struct fuse_daxdev_out {
          uint16_t index;
          uint16_t reserved;
          uint32_t reserved2;
          uint64_t reserved3;
          uint64_t reserved4;
          char     name[256];       /* "/dev/daxN.M" */
      };

  GET_FMAP reply layout in the buffer (fmap_header followed by extents):

       SIMPLE:        [ fmap_header ][ simple_ext * nextents ]
       INTERLEAVE:    [ fmap_header ][ iext, simple_ext*nstrips,
                                       iext, simple_ext*nstrips, ... ]
                      where there are `nextents` (iext + its strips) groups.

  Alignment rules the server MUST honor (else the kernel rejects the fmap):
      * fmap_version == 1
      * 1 <= nextents <= FUSE_FAMFS_MAX_EXTENTS (32)
      * For each strip extent: ext_offset and ext_len PMD-aligned (2 MiB)
      * For interleaved: chunk_size PMD-aligned, nstrips in [1, 32]
      * sum of extent lengths >= file_size

  GET_DAXDEV reply: a single fuse_daxdev_out where `name` is the path of a
  character device that the kernel can `kern_path()` to a devdax inode.

5.5 What the server is responsible for

  In the famfs design, the libfuse-based server still owns:
      * Looking up files in the famfs metadata log (or whatever backend
        userspace uses to track allocations).
      * Producing fmaps that exactly describe the file's allocation.
      * Producing the "/dev/daxN.M" path for each daxdev index it has
        used in any fmap.
      * All conventional FUSE ops: lookup, getattr, mkdir, unlink, etc.

  The server is NOT in the path of any read/write/mmap once the fmap has
  been delivered. There is no equivalent of FUSE_READ / FUSE_WRITE traffic
  for famfs files.


--------------------------------------------------------------------------------
6. Open flow - GET_FMAP and (lazy) GET_DAXDEV
--------------------------------------------------------------------------------

When a regular file is opened on a famfs-enabled connection, the kernel pulls
the file's fmap from the server, parses it, resolves any unknown daxdev
indices via GET_DAXDEV, and installs the result on the inode.

  fuse_open(inode, file)                              [fs/fuse/file.c]
    |
    +-- fuse_do_open()                                (regular FUSE open)
    |
    +-- if (fc->famfs_iomap && S_ISREG)
    |     fuse_get_fmap(fm, inode)                    [famfs.c]
    |         |
    |         +-- alloc fmap_buf (PAGE_SIZE)
    |         |
    |         +-- args.opcode = FUSE_GET_FMAP
    |         |   args.nodeid = ino
    |         |   args.out_argvar = true             (variable-size reply)
    |         +-- fuse_simple_request(fm, &args)  ----> server returns
    |         |                                        fuse_famfs_fmap_header
    |         |                                        + extents
    |         +-- famfs_file_init_dax(fm, inode, fmap_buf, fmap_size)
    |               |
    |               +-- famfs_fuse_meta_alloc()
    |               |     parses header + extents into struct famfs_file_meta;
    |               |     accumulates meta->dev_bitmap of referenced devindices;
    |               |     validates PMD alignment + total size >= file_size;
    |               |     cmpxchg-installs *metap (race-safe)
    |               |
    |               +-- famfs_update_daxdev_table(fm, meta)
    |               |     if (!fc->dax_devlist) cmpxchg-allocate it
    |               |     under famfs_devlist_sem (read):
    |               |        collect indices that are NOT yet ->valid
    |               |     drop lock, then for each index:
    |               |        famfs_fuse_get_daxdev(fm, idx)        <see below>
    |               |
    |               +-- inode_lock(inode)
    |               |   famfs_meta_set(fi, meta)             (cmpxchg, 
NULL=>meta)
    |               |   if installed: i_size_write, S_DAX, a_ops=famfs_dax_aops
    |               |   inode_unlock(inode)
    |
    +-- fuse_finish_open(inode, file)
    +-- skip page cache invalidation if fuse_file_famfs(fi)


GET_DAXDEV per-index flow:

  famfs_fuse_get_daxdev(fm, index):
    args.opcode = FUSE_GET_DAXDEV
    args.nodeid = index
    fuse_simple_request()  -----> server returns fuse_daxdev_out{.name = 
"/dev/daxX.Y"}

    under famfs_devlist_sem (write):
      if dd->valid: return                         /* lost race; OK */
      famfs_verify_daxdev(name, &dd->devno):
          kern_path() + d_backing_inode() + S_ISCHR
          may_open_dev()                           /* exported in fs/namei.c */
          dd->devno = inode->i_rdev
      dd->name = kstrdup(name)
      dd->devp = dax_dev_get(devno)
      fs_dax_get(devp, fc, &famfs_fuse_dax_holder_ops)
          on failure: dd->dax_err = 1              /* still mark valid */
      wmb()
      dd->valid = 1


--------------------------------------------------------------------------------
7. iomap interaction - the central design point
--------------------------------------------------------------------------------

famfs implements only `.iomap_begin`. There is no `.iomap_end` because there
is no allocation, dirty tracking, or completion bookkeeping.

    const struct iomap_ops famfs_iomap_ops = {
        .iomap_begin = famfs_fuse_iomap_begin,
    };

The dax core (fs/dax.c) calls into famfs_iomap_ops from three entry points:

    dax_iomap_rw(iocb, iter, ops)        -- read_iter / write_iter
    dax_iomap_fault(vmf, order, ...)     -- mmap PTE/PMD/PUD faults

For the iomap concepts used here, see the primer in section 2.

7.1 read/write path

  fuse_file_read_iter(iocb, to)                 [fs/fuse/file.c]
    if (fuse_file_famfs(fi))
        return famfs_fuse_read_iter(iocb, to);  [famfs.c]

  famfs_fuse_read_iter:
    famfs_fuse_rw_prep(iocb, to):
        famfs_file_bad(inode)?                    -> -EIO/-ENXIO
        truncate iter to (i_size - ki_pos)
    dax_iomap_rw(iocb, to, &famfs_iomap_ops)  ===>
                                                  +-- repeatedly:
                                                  |     iomap_begin(...)
                                                  |     memcpy_from_pmem/to
                                                  |     advance position
                                                  +-- returns bytes copied

  famfs_fuse_write_iter is symmetric (no FOPEN_DIRECT_IO / passthrough fork;
  splice paths return -EIO since famfs has no page cache).

7.2 mmap path

  fuse_file_mmap(file, vma)                     [fs/fuse/file.c]
    if (fuse_file_famfs(fi))
        return famfs_fuse_mmap(file, vma);

  famfs_fuse_mmap:
    famfs_file_bad(inode)
    vma->vm_ops = &famfs_file_vm_ops
    vm_flags_set(vma, VM_HUGEPAGE)              /* prefer 2MiB faults */

  famfs_file_vm_ops:
    .fault         = famfs_filemap_fault         (PTE)
    .huge_fault    = famfs_filemap_huge_fault    (PMD/PUD)
    .map_pages     = filemap_map_pages
    .page_mkwrite  = famfs_filemap_mkwrite
    .pfn_mkwrite   = famfs_filemap_mkwrite

7.3 fault handler dispatch

  __famfs_fuse_filemap_fault(vmf, pe_size, write_fault):
    if (!IS_DAX(inode)) return SIGBUS
    if (write_fault) sb_start_pagefault, file_update_time
    ret = dax_iomap_fault(vmf, pe_size, &pfn, NULL, &famfs_iomap_ops)
        |
        +-- internally calls famfs_fuse_iomap_begin to learn (dax_dev,
            offset, length), then maps the resolved PFN into the VMA.
    if (ret & VM_FAULT_NEEDDSYNC) ret = dax_finish_sync_fault(...)

7.4 iomap_begin - the resolver

  famfs_fuse_iomap_begin(inode, offset, length, flags, iomap, srcmap)
    meta = fi->famfs_meta
    WARN_ON(i_size != meta->file_size)
    return famfs_fileofs_to_daxofs(inode, iomap, offset, length, flags)

  famfs_fileofs_to_daxofs (SIMPLE case):
      validate dax_devlist + famfs_file_bad
      walk meta->se[0..fm_nextents-1]:
          if local_offset < se[i].ext_len:
              dd = devlist[se[i].dev_index]
              famfs_dax_err(dd) -> if errored, set meta->error and return
              iomap->addr   = se[i].ext_offset + local_offset
              iomap->offset = file_offset
              iomap->length = min(len, ext_len - local_offset)
              iomap->dax_dev= dd->devp
              iomap->type   = IOMAP_MAPPED
              return 0
          local_offset -= se[i].ext_len
      fall-through: zero-length iomap, return -EIO

  famfs_fileofs_to_daxofs delegates to famfs_interleave_fileofs_to_daxofs for
  INTERLEAVED_EXTENT (see section 6).

7.5 The full iomap call graph

    user process                   fs/fuse/famfs.c                  fs/dax.c
    ------------                   ----------------                 --------
    read(2)/write(2)
        |
        v
    fuse_file_read_iter / write_iter (file.c)
        |
        +--> famfs_fuse_{read,write}_iter
                |
                +--> famfs_fuse_rw_prep (sanity, truncate to i_size)
                |
                +--> dax_iomap_rw  ----------------->  iter loop
                                                           |
                                                           v
                                                       iomap_iter()
                                                           |
                                                           +--> .iomap_begin 
<----+
                                                           |     
famfs_fuse_iomap_begin
                                                           |        \           
  |
                                                           |         
famfs_fileofs_to_daxofs
                                                           |          [+ 
interleave variant]
                                                           |             \
                                                           |              
fc->dax_devlist[idx]
                                                           |              
dd->devp / ext_offset
                                                           |             /
                                                           +--> dax_iomap_iter()
                                                                   memcpy via 
dax_direct_access
                                                                   on 
iomap->dax_dev / iomap->addr

    page fault on mmap region
        |
        v
    .fault / .huge_fault (famfs_file_vm_ops)
        |
        +--> __famfs_fuse_filemap_fault
                |
                +--> dax_iomap_fault(vmf, order, ..., &famfs_iomap_ops)
                                                       |
                                                       +--> .iomap_begin
                                                       |    
famfs_fuse_iomap_begin
                                                       |       (resolves 
dax_dev + offset)
                                                       +--> dax_insert_pfn / 
vmf_insert_pfn_pmd


--------------------------------------------------------------------------------
8. Interleaved (striped) extents
--------------------------------------------------------------------------------

An interleaved extent stripes a contiguous logical region across N strips on
N (typically distinct) dax devices, in fixed-size chunks.

    ie_nstrips      = N
    ie_chunk_size   = C  (must be PMD-aligned)
    ie_nbytes       = total logical bytes covered

  Logical layout (N=4):

    file offset:  [0          C ][C        2C][2C       3C][3C       4C][4C ...
                  | strip 0    | strip 1    | strip 2    | strip 3    |  strip 
0 ...
                  | stripe 0   | stripe 0   | stripe 0   | stripe 0   |  stripe 
1...

  Resolution arithmetic in famfs_interleave_fileofs_to_daxofs():

    chunk_num       = local_offset / chunk_size
    chunk_offset    = local_offset % chunk_size
    chunk_remainder = chunk_size - chunk_offset
    stripe_num      = chunk_num / nstrips
    strip_num       = chunk_num % nstrips
    strip_offset    = chunk_offset + stripe_num * chunk_size

    iomap->addr     = ie_strips[strip_num].ext_offset + strip_offset
    iomap->dax_dev  = devlist[ie_strips[strip_num].dev_index].devp
    iomap->length   = min(len, chunk_remainder)
    iomap->type     = IOMAP_MAPPED

The length is capped at chunk_remainder so the next iomap iteration steps to
the next chunk (which usually lives on a different device).


--------------------------------------------------------------------------------
9. Memory-error / failure handling
--------------------------------------------------------------------------------

A famfs file becomes unusable when any one of three conditions is true. They
are checked on every read/write/fault by famfs_file_bad() and famfs_dax_err().

  Source of error                Effect                       Surface
  ---------------                ------                       -------
  fs_dax_get() fails             dd->dax_err = 1              famfs_dax_err -> 
-EIO
  notify_failure() upcall        dd->error   = true           famfs_dax_err -> 
-EHWPOISON
  i_size != meta->file_size      meta->error = true           famfs_file_bad -> 
-ENXIO
  IS_DAX(inode) cleared          (size change, etc.)          famfs_file_bad -> 
-ENXIO

  notify_failure() flow:

    devdax layer detects poison / reconfig
        |
        v
    dax_holder_ops->notify_failure(dax_devp, offset, len, mf_flags)
        = famfs_dax_notify_failure
            fc = dax_holder(dax_devp)
            famfs_set_daxdev_err(fc, dax_devp):
                under famfs_devlist_sem (write):
                    find slot whose dd->devp == dax_devp
                    dd->error = true
                    pr_err

  On the next iomap_begin, famfs_dax_err sees dd->error and returns -EHWPOISON;
  meta->error is also set on that file so subsequent accesses short-circuit
  via famfs_file_bad without touching dax.


--------------------------------------------------------------------------------
10. Lifetime / teardown
--------------------------------------------------------------------------------

Per-inode:

  fuse_alloc_inode (inode.c)
      famfs_meta_set(fi, NULL)                (init)
  fuse_free_inode (inode.c)
      if (S_ISREG && fuse_file_famfs(fi))
          famfs_meta_free(fi)
              -> __famfs_meta_free: frees se/ie arrays + struct
  fuse_evict_inode
      if (FUSE_IS_VIRTIO_DAX || fuse_file_famfs)
          dax_break_layout_final(inode)       (stop ongoing dax mappings)

Per-connection:

  fuse_conn_put -> famfs_teardown(fc):
      for each valid slot:
          if dd->devp:
              if (!dd->dax_err) fs_put_dax(dd->devp, fc)   /* drop holder */
              put_dax(dd->devp)
          kfree(dd->name)
      kfree(devlist->devlist)
      kfree(devlist)


--------------------------------------------------------------------------------
11. Concurrency model
--------------------------------------------------------------------------------

  fc->famfs_devlist_sem  (rw_semaphore)
      readers : iomap_begin paths reading devlist[idx]
                famfs_update_daxdev_table while collecting "missing" indices
      writers : famfs_fuse_get_daxdev (populating a slot)
                famfs_set_daxdev_err  (notify_failure)

  cmpxchg pairs (NULL -> ptr installation, race-tolerant):
      fc->dax_devlist               (first-time allocation)
      fi->famfs_meta                (first GET_FMAP wins, others freed)

  wmb() before dd->valid=1 ensures readers that observe `valid` see fully
  initialized name/devp/devno fields.


--------------------------------------------------------------------------------
12. End-to-end timeline (read on a freshly opened famfs file)
--------------------------------------------------------------------------------

The "fuse server" column is whatever process is using libfuse (with
op.get_fmap / op.get_daxdev populated as in section 5).

  app                   fuse/famfs (kernel)              fuse server
  ---                   -------------------              -----------
  open("/mnt/famfs/x")                                      |
       ----- VFS open -----> fuse_open                      |
                            fuse_do_open ----- OPEN -----> handles
                                          <---- ok --------|
                            fuse_get_fmap                   |
                                          - GET_FMAP ----->|
                                          <- fmap reply ---|
                            famfs_fuse_meta_alloc           |
                            famfs_update_daxdev_table       |
                              [new device idx]              |
                                          - GET_DAXDEV --->|
                                          <- daxdev reply -|
                              dax_dev_get + fs_dax_get      |
                              dd->valid = 1                 |
                            famfs_meta_set(fi, meta)        |
                            inode->i_flags |= S_DAX         |
                            i_data.a_ops = famfs_dax_aops   |
       <----- fd ------------                               |
  read(fd, buf, len)                                        |
       ----- VFS read ----> fuse_file_read_iter             |
                            famfs_fuse_read_iter            |
                              famfs_fuse_rw_prep            |
                              dax_iomap_rw                  |
                                .iomap_begin ->             |
                                  famfs_fuse_iomap_begin    |
                                  famfs_fileofs_to_daxofs   |
                                    -> iomap{dax_dev, addr} |
                                memcpy from dax memory      |
       <---- bytes ----------                               |
                                                            |   <-- no upcall 
on fast path
  munmap / close
       ----- VFS release -> fuse_release ----- RELEASE ---> |
                                          <---- ok --------|
  unmount
       ----- umount ------> fuse_conn_put                   |
                            famfs_teardown                  |
                              fs_put_dax / put_dax all dd's |


--------------------------------------------------------------------------------
13. What deliberately is NOT in the kernel
--------------------------------------------------------------------------------

  * Allocation and metadata mutation: handled in userspace; the kernel only
    consumes fmaps as opaque-but-versioned blobs.
  * Page cache and writeback: famfs_dax_aops is exclusively noop_dirty_folio.
  * Truncate / append: any size change marks the file errored; recovery is a
    userspace responsibility (typically: re-replay the famfs metadata log).
  * fallocate / hole handling: files are never sparse and never have holes,
    so iomap_begin only ever returns IOMAP_MAPPED (or zero-length on EOF).
  * io-modes (FUSE_OPEN_*): bypassed for famfs files in iomode.c since
    everything is direct-to-dax.


--------------------------------------------------------------------------------
14. Commit-by-commit map back to this design
--------------------------------------------------------------------------------

14.1 Kernel (linux.git, branch famfs)

  ac071fbd94a6  Basic fuse kernel ABI            -> Section 4 (negotiation),
                                                    CONFIG_FUSE_FAMFS_DAX,
                                                    fc->famfs_iomap bit
  9a06500c1e0f  Plumb GET_FMAP message/response  -> Section 6 (fuse_get_fmap,
                                                    fuse_open hook)
  6f4e03a4e8e9  Create files with famfs fmaps    -> Section 3 (famfs_file_meta),
                                                    Section 6 
(famfs_file_init_dax,
                                                    famfs_fuse_meta_alloc)
  dfc9e12bcb99  GET_DAXDEV msg + daxdev_table    -> Section 3 
(famfs_dax_devlist),
                                                    Section 6.GET_DAXDEV,
                                                    famfs_teardown
  d79f803dbfd1  Plumb dax iomap + r/w/mmap       -> Section 7 (iomap_ops, fault,
                                                    rw paths) and Section 8
                                                    (interleave resolver)
  8731eb03c762  holder_ops for notify_failure()  -> Section 9 (memory errors)
  6ea21f89b361  DAX address_space_operations     -> Section 1 / famfs_dax_aops
  fae4d807da34  fmap metadata documentation      -> kernel header comment in
                                                    famfs_kfmap.h (Section 8)
  da9edf77cbc4  Documentation/filesystems/famfs  -> user-facing docs

14.2 libfuse (libfuse.git, branch famfs)

  d75ae2ee  fuse_kernel.h: bring up to baseline 6.19
                Mechanical sync of include/fuse_kernel.h with the kernel uapi
                up to 7.45 (everything BEFORE famfs). No new functionality;
                this is the baseline the famfs commits build on.

  e87be376  fuse_kernel.h: add famfs DAX fmap protocol definitions
                Adds protocol 7.46:
                  * FUSE_DAX_FMAP capability bit
                  * FUSE_GET_FMAP / FUSE_GET_DAXDEV opcodes
                  * struct fuse_famfs_fmap_header / simple_ext / iext
                  * struct fuse_get_daxdev_in / fuse_daxdev_out
                  * enum fuse_famfs_file_type / famfs_ext_type
                Pure header; mirrors include/uapi/linux/fuse.h on the kernel.
                -> Section 5.4 (wire formats).

  0b16c7d8  fuse: add famfs DAX fmap support
                Wires the protocol into the libfuse lowlevel API:
                  * include/fuse_common.h:   FUSE_CAP_DAX_FMAP (1UL << 32)
                  * include/fuse_lowlevel.h: op.get_fmap, op.get_daxdev
                  * lib/fuse_lowlevel.c:
                      - INIT: capable_ext / want_ext <-> FUSE_DAX_FMAP
                      - dispatch table entries for GET_FMAP / GET_DAXDEV
                      - do_get_fmap / do_get_daxdev forward to op callbacks
                -> Section 4 (capability), Section 5.1-5.3 (libfuse API),
                   Section 5.5 (server responsibilities).

  d1e6135c  build(deps): bump github/codeql-action ...     (CI; not relevant)
  fa03307c  doc: replace "futur irrealis"-like tense ...   (man pages; not 
relevant)
  9c65d781  Merge branch 'master' into famfs-6.19          (merge commit)

================================================================================
End of document.
================================================================================

Reply via email to