On 2026/1/5 02:44, Amir Goldstein wrote:
On Sun, Jan 4, 2026 at 11:42 AM Gao Xiang <[email protected]> wrote:



On 2026/1/4 18:01, Amir Goldstein wrote:
[+fsdevel][+overlayfs]

On Sun, Jan 4, 2026 at 4:56 AM Gao Xiang <[email protected]> wrote:

Hi Amir,

On 2026/1/1 23:52, Amir Goldstein wrote:
On Wed, Dec 31, 2025 at 9:42 PM Gao Xiang <[email protected]> wrote:

Previously, commit d53cd891f0e4 ("erofs: limit the level of fs stacking
for file-backed mounts") bumped `s_stack_depth` by one to avoid kernel
stack overflow, but it breaks composefs mounts, which need erofs+ovl^2
sometimes (and such setups are already used in production for quite long
time) since `s_stack_depth` can be 3 (i.e., FILESYSTEM_MAX_STACK_DEPTH
needs to change from 2 to 3).

After a long discussion on GitHub issues [1] about possible solutions,
it seems there is no need to support nesting file-backed mounts as one
conclusion (especially when increasing FILESYSTEM_MAX_STACK_DEPTH to 3).
So let's disallow this right now, since there is always a way to use
loopback devices as a fallback.

Then, I started to wonder about an alternative EROFS quick fix to
address the composefs mounts directly for this cycle: since EROFS is the
only fs to support file-backed mounts and other stacked fses will just
bump up `FILESYSTEM_MAX_STACK_DEPTH`, just check that `s_stack_depth`
!= 0 and the backing inode is not from EROFS instead.

At least it works for all known file-backed mount use cases (composefs,
containerd, and Android APEX for some Android vendors), and the fix is
self-contained.

Let's defer increasing FILESYSTEM_MAX_STACK_DEPTH for now.

Fixes: d53cd891f0e4 ("erofs: limit the level of fs stacking for file-backed 
mounts")
Closes: https://github.com/coreos/fedora-coreos-tracker/issues/2087 [1]
Closes: 
https://lore.kernel.org/r/CAFHtUiYv4+=+JP_-JjARWjo6OwcvBj1wtYN=z0qxwcpec9s...@mail.gmail.com
Cc: Amir Goldstein <[email protected]>
Cc: Alexander Larsson <[email protected]>
Cc: Christian Brauner <[email protected]>
Cc: Miklos Szeredi <[email protected]>
Signed-off-by: Gao Xiang <[email protected]>
---

Acked-by: Amir Goldstein <[email protected]>

But you forgot to include details of the stack usage analysis you ran
with erofs+ovl^2 setup.

I am guessing people will want to see this information before relaxing
s_stack_depth in this case.

Sorry I didn't check emails these days, I'm not sure if posting
detailed stack traces are useful, how about adding the following
words:

Didn't mean detailed stack traces, but you did some tests with the
new possible setup and you reached stack usage < 8K so  I think this is

The issue is that my limited stress test setup cannot cover
every cases:

   - I cannot find a way to make direct reclaim reliably in the
     deep memory allocation, is there some suggestion on this?

   - I'm not sure what's the perfered way to evaluate the worst
     stack usage below the block layer, but we should care more
     about increasing delta just out of one more overlayfs I
     guess?

I can only say what I've seen is the peak stack usage of my
fsstress for an erofs+ovl^2 setup on x86_64 is < 8K (7184 bytes,
but I don't think the peak value absolutely useful), which
evaluates RW workloads in the upperdir, and for such workloads,
the stack depth won't be impacted by FILESYSTEM_MAX_STACK_DEPTH,
I don't see such workload is harmful.

And then I manually copyup some files (because I didn't find any
available tool to stress overlayfs copyups) and I could see the
delta is (I think "ovl_copy_up_" is the only one path to do
copyups):

    0)     6688      48   mempool_alloc_slab+0x9/0x20
    1)     6640      56   mempool_alloc_noprof+0x65/0xd0
    2)     6584      72   __sg_alloc_table+0x128/0x190
    3)     6512      40   sg_alloc_table_chained+0x46/0xa0
    4)     6472      64   scsi_alloc_sgtables+0x91/0x2c0
    5)     6408      72   sd_init_command+0x263/0x930
    6)     6336      88   scsi_queue_rq+0x54a/0xb70
    7)     6248     144   blk_mq_dispatch_rq_list+0x265/0x6c0
    8)     6104     144   __blk_mq_sched_dispatch_requests+0x399/0x5c0
    9)     5960      16   blk_mq_sched_dispatch_requests+0x2d/0x70
   10)     5944      56   blk_mq_run_hw_queue+0x208/0x290
   11)     5888      96   blk_mq_dispatch_list+0x13f/0x460
   12)     5792      48   blk_mq_flush_plug_list+0x4b/0x180
   13)     5744      32   blk_add_rq_to_plug+0x3d/0x160
   14)     5712     136   blk_mq_submit_bio+0x4f4/0x760
   15)     5576     120   __submit_bio+0x9b/0x240
   16)     5456      88   submit_bio_noacct_nocheck+0x271/0x330
   17)     5368      72   iomap_bio_read_folio_range+0xde/0x1d0
   18)     5296     112   iomap_read_folio_iter+0x1ee/0x2d0
   19)     5184     264   iomap_readahead+0xb9/0x290
   20)     4920      48   xfs_vm_readahead+0x4a/0x70
   21)     4872     112   read_pages+0x6c/0x1b0
   22)     4760     104   page_cache_ra_unbounded+0x12c/0x210
   23)     4656      80   filemap_readahead.isra.0+0x78/0xb0
   24)     4576     192   filemap_get_pages+0x3a6/0x820
   25)     4384     376   filemap_read+0xde/0x380
   26)     4008      32   xfs_file_buffered_read+0xa6/0xd0
   27)     3976      16   xfs_file_read_iter+0x6a/0xd0
   28)     3960      48   vfs_iocb_iter_read+0xdb/0x140
   29)     3912      88   erofs_fileio_rq_submit+0x136/0x190
   30)     3824     368   z_erofs_runqueue+0x1ce/0x9f0
   31)     3456     232   z_erofs_readahead+0x16c/0x220
   32)     3224     112   read_pages+0x6c/0x1b0
   33)     3112     104   page_cache_ra_unbounded+0x12c/0x210
   34)     3008      80   filemap_readahead.isra.0+0x78/0xb0
   35)     2928     192   filemap_get_pages+0x3a6/0x820
   36)     2736     400   filemap_splice_read+0x12c/0x2f0
   37)     2336      48   backing_file_splice_read+0x3f/0x90
   38)     2288     128   ovl_splice_read+0xef/0x170
   39)     2160     104   splice_direct_to_actor+0xb9/0x260
   40)     2056      88   do_splice_direct+0x76/0xc0
   41)     1968     120   ovl_copy_up_file+0x1a8/0x2b0
   42)     1848     840   ovl_copy_up_one+0x14b0/0x1610
   43)     1008      72   ovl_copy_up_flags+0xd7/0x110
   44)      936      56   ovl_open+0x72/0x110
   45)      880      56   do_dentry_open+0x16c/0x480
   46)      824      40   vfs_open+0x2e/0xf0
   47)      784     152   path_openat+0x80a/0x12e0
   48)      632     296   do_filp_open+0xb8/0x160
   49)      336      80   do_sys_openat2+0x72/0xf0
   50)      256      40   __x64_sys_openat+0x57/0xa0
   51)      216      40   do_syscall_64+0xa4/0x310
   52)      176     176   entry_SYSCALL_64_after_hwframe+0x77/0x7f

And it's still far from the stack overflow of 16k stacks,
because the difference seems only how many (
ovl_splice_read + backing_file_splice_read), and there only takes
hundreds of bytes for each layer.

Finally I used my own rostress to stress RO workloads, and the
deepest stack so far is as below (5456 bytes):

    0)     5456      48   arch_scale_cpu_capacity+0x9/0x30
    1)     5408      16   cpu_util.constprop.0+0x7e/0xe0
    2)     5392     392   sched_balance_find_src_group+0x29f/0xd30
    3)     5000     280   sched_balance_rq+0x1b2/0xf10
    4)     4720     120   pick_next_task_fair+0x23b/0x7b0
    5)     4600     104   __schedule+0x2bc/0xda0
    6)     4496      16   schedule+0x27/0xd0
    7)     4480      24   io_schedule+0x46/0x70
    8)     4456     120   blk_mq_get_tag+0x11b/0x280
    9)     4336      96   __blk_mq_alloc_requests+0x2a1/0x410
   10)     4240     136   blk_mq_submit_bio+0x59c/0x760
   11)     4104     120   __submit_bio+0x9b/0x240
   12)     3984      88   submit_bio_noacct_nocheck+0x271/0x330
   13)     3896      72   iomap_bio_read_folio_range+0xde/0x1d0
   14)     3824     112   iomap_read_folio_iter+0x1ee/0x2d0
   15)     3712     264   iomap_readahead+0xb9/0x290
   16)     3448      48   xfs_vm_readahead+0x4a/0x70
   17)     3400     112   read_pages+0x6c/0x1b0
   18)     3288     104   page_cache_ra_unbounded+0x12c/0x210
   19)     3184      80   filemap_readahead.isra.0+0x78/0xb0
   20)     3104     192   filemap_get_pages+0x3a6/0x820
   21)     2912     376   filemap_read+0xde/0x380
   22)     2536      32   xfs_file_buffered_read+0xa6/0xd0
   23)     2504      16   xfs_file_read_iter+0x6a/0xd0
   24)     2488      48   vfs_iocb_iter_read+0xdb/0x140
   25)     2440      88   erofs_fileio_rq_submit+0x136/0x190
   26)     2352     368   z_erofs_runqueue+0x1ce/0x9f0
   27)     1984     232   z_erofs_readahead+0x16c/0x220
   28)     1752     112   read_pages+0x6c/0x1b0
   29)     1640     104   page_cache_ra_unbounded+0x12c/0x210
   30)     1536      40   force_page_cache_ra+0x96/0xc0
   31)     1496     192   filemap_get_pages+0x123/0x820
   32)     1304     376   filemap_read+0xde/0x380
   33)      928      72   do_iter_readv_writev+0x1b9/0x220
   34)      856      56   vfs_iter_read+0xde/0x140
   35)      800      64   backing_file_read_iter+0x193/0x1e0
   36)      736      56   ovl_read_iter+0x98/0xa0
   37)      680      72   do_iter_readv_writev+0x1b9/0x220
   38)      608      56   vfs_iter_read+0xde/0x140
   39)      552      64   backing_file_read_iter+0x193/0x1e0
   40)      488      56   ovl_read_iter+0x98/0xa0
   41)      432     152   vfs_read+0x21a/0x350
   42)      280      64   __x64_sys_pread64+0x92/0xc0
   43)      216      40   do_syscall_64+0xa4/0x310
   44)      176     176   entry_SYSCALL_64_after_hwframe+0x77/0x7f

something worth mentioning.


Note: There are some observations while evaluating the erofs + ovl^2
setup with an XFS backing fs:

    - Regular RW workloads traverse only one overlayfs layer regardless of
      the value of FILESYSTEM_MAX_STACK_DEPTH, because `upperdir=` cannot
      point to another overlayfs.  Therefore, for pure RW workloads, the
      typical stack is always just:
        overlayfs + upper fs + underlay storage

    - For read-only workloads and the copy-up read part (ovl_splice_read),
      the difference can lie in how many overlays are nested.
      The stack just looks like either:
        ovl + ovl [+ erofs] + backing fs + underlay storage
      or
        ovl [+ erofs] + ext4/xfs + underlay storage

    - The fs reclaim path should be entered only once, so the writeback
      path will not re-enter.

Sorry about my English, and I'm not sure if it's enough (e.g. FUSE
passthrough part).  I will look for your further inputs (and other
acks) before sending this patch upstream.


I think that most people will have problems understanding this
rationale not because of the English, but because of the tech ;)
this is a bit too hand wavy IMO.

Honestly, I don't have better way to describe it, I think we'd
better just to focus more on the increment of one more overlayfs:


ok. but are we talking about one more overlayfs?
This patch is adding just one erofs, so what am I missing?

Sorry, I didn't describe accurately, first I tested erofs+ovl^2.

It's the last overlayfs mount fails, and the stack traces start
from the last overlayfs.  So compared with the normal erofs+ovl
(since it can be mounted in the upsteam correctly without this
 patch), I mean it's one more overlayfs.


FILESYSTEM_MAX_STACK_DEPTH 2 already works for 8k kstacks on
32-bit arches, so I don't think FILESYSTEM_MAX_STACK_DEPTH from
2 to 3, which causes hundreds-more-byte additional stack usage
out of mediate overlayfs on 16k kstacks on 64-bit arches is
harmful (and only RO workloads and copyups are impacted).

And if hundreds-more-byte additional stack usage can overflow
the 16k kstack, I do think then the kernel stack can be
overflowed randomly everywhere in the storage stack, not just
because this FILESYSTEM_MAX_STACK_DEPTH modification.


Fine by me, but does that mean that you only want to allow
erofs backing files with >8K stack size?

If FILESYSTEM_MAX_STACK_DEPTH 2 without this patch, erofs+ovl
can still success (so it should guarantee erofs+ovl is always
fine), so compared with that, FILESYSTEM_MAX_STACK_DEPTH 3,
the extra stack is always one more overlayfs I think (either
erofs+ovl^2 or ovl^3)?

Thanks,
Gao Xiang


Otherwise, I do not follow your argument.

Thanks,
Amir.


Reply via email to