Hi Christoph,

Sorry I didn't phrase things clearly earlier, but I'd still
like to explain the whole idea, as this feature is clearly
useful for containerization. I hope we can reach agreement
on the page cache sharing feature: Christian agreed on this
feature (and I hope still):

https://lore.kernel.org/linux-fsdevel/20260112-begreifbar-hasten-da396ac2759b@brauner

First, let's separate this feature from mounting in user
namespaces (i.e., unprivileged mounts), because this feature
is designed specifically for privileged mounts.

The EROFS page cache sharing feature stems from a current
limitation in the page cache: a file-based folio cannot be
shared across different inode mappings (or the different
page index within the same mapping; If this limitation
were resolved, we could implement a finer-grained page
cache sharing mechanism at the folio level). As you may
know, this patchset dates back to 2023, and as of 2026; I
still see no indication that the page cache infra will
change.

So that let's face the reality: this feature introduces
on-disk xattrs called "fingerprints." --- Since they're
just xattrs, the EROFS on-disk format remains unchanged.

A new compat feature bit in the superblock indicates
whether an EROFS image contains such xattrs.

=====
In short: no on-disk format changes are required for
page cache sharing -- only xattrs attached to inodes
in the EROFS image.

Even if finer-grained page cache sharing is implemented
many many years later, existing images will remain
compatible, as we can simply ignore those xattrs.
=====

At runtime, the feature is explicitly enabled via a new
mount option: `inode_share`, which is intended only for
privileged mounters. A `domain_id` must also be specified
to define a trusted domain. This means:

 - For regular EROFS mounts (without `inode_share`;
   default), no page cache sharing happens for those
   images;

 - For mounts with `inode_share`, page cache sharing is
   allowed only among mounts with the same `domain_id`.

The `domain_id` can be thought of as defining a federated
super-filesystem: data of the unique "fingerprints" (e.g.,
secure hashes or UUIDs) may come from any of the
participating filesystems, but page cache is the only one.



EROFS is an immutable, image-based golden filesystem: its
(meta)data is generated entirely in userspace. I consider
it as a special class of disk filesystem, so traditional
assumptions about generic read-write filesystems don't
always apply; and the image filesystem (especially for
containers) can also have unique features according to
image use cases against typical local filesystems.

As for unpriviledged mounts, that is another story (clearly
there are different features at least at runtime), first
I think no one argues whether mounting in the user space
is useful for containers: I do agree it should have a formal
written threat model in advance. While I'm not a security
expert per se, I'll draft one later separately.

My rough thoughts are:

 - Let's not focusing entirely on the random human bugs,
   because I think every practical subsystem should have bugs,
   the whole threat model focuses on the system design, and
   less code doesn't mean anything (buggy or even has system
   design flaw)

 - EROFS only accesses the (meta)data from the source blobs
   specified at mount time, even with multi-device support:

    mount -t erofs -odevice=[blob],device=[blob],... [source]

   An EROFS mount instance never accesses data beyond those
   blobs.  Moreover, EROFS holds reference counts on these
   blobs for the entire lifetime of the mounted filesystem
   (so even if a blob is deleted, blobs remain accessible as
   orphan/deleted inodes).

 - As a strictly immutable filesystem, EROFS never writes to
   underlying blobs/devices and thus avoids complicated space
   allocation, deallocation, reverse mapping or journaling
   writeback consistency issues from its design in writable
   filesystems like ext4, XFS, or BTRFS.  However, it doesn't
   mean EROFS cannot bear random (meta)data change from
   modifing blobs directly from external users.

 - External users can modify underlay blobs/devices only when
   they have permission to the blobs/devices, so there is no
   privilege escalation risk; so I think "Sneaking in
   unexpected data" isn't meaningful here -- you need proper
   permissions to alter the source blobs;

   So then the only question is whether EROFS's on-disk design
   can safely handle arbitrary (even fuzzed) external
   modifications. I believe it can: because EROFS don't
   have any redundant metadata especially for space allocation
   , reverse mapping and journalling like EXT4, XFS, BTRFS.

   Thus, it avoids the kinds of severe inconsistency bugs
   seen in generic readwrite filesystems; if you say corruption
   or inconsientcy, you should define the corruption.  Almost
   all severe inconsientcy issue cannot be seen as inconsientcy
   from EROFS on-disk design itself, also see:
   https://erofs.docs.kernel.org/en/latest/imagefs.html

 - Of course, unprivileged kernel EROFS mounts should start
   from a minimal core on-disk format, typically the following:
   https://erofs.docs.kernel.org/en/latest/core_ondisk.html

   I'll clarify this together with the full security model
   later if this feature really gets developped;

 - In the end, I don't think various wild non-technical
   assumptions makes any sense to form out the correct design
   of unprivileged mounts, if a real security threat exists, it
   should first have a potential attack path written in words
   (even in theory), but I can't identify any practical one
   based on the design in my mind.

All in all, I'm open to hear and discuss any potential
threat or valid argument and find the final answers, but I do
think we should keep discussion in the technical way rather
than purely in policy as in the previous related threads.

Thanks,
Gao Xiang

Reply via email to