On 2026/1/4 18:01, Amir Goldstein wrote:
[+fsdevel][+overlayfs]

On Sun, Jan 4, 2026 at 4:56 AM Gao Xiang <[email protected]> wrote:

Hi Amir,

On 2026/1/1 23:52, Amir Goldstein wrote:
On Wed, Dec 31, 2025 at 9:42 PM Gao Xiang <[email protected]> wrote:

Previously, commit d53cd891f0e4 ("erofs: limit the level of fs stacking
for file-backed mounts") bumped `s_stack_depth` by one to avoid kernel
stack overflow, but it breaks composefs mounts, which need erofs+ovl^2
sometimes (and such setups are already used in production for quite long
time) since `s_stack_depth` can be 3 (i.e., FILESYSTEM_MAX_STACK_DEPTH
needs to change from 2 to 3).

After a long discussion on GitHub issues [1] about possible solutions,
it seems there is no need to support nesting file-backed mounts as one
conclusion (especially when increasing FILESYSTEM_MAX_STACK_DEPTH to 3).
So let's disallow this right now, since there is always a way to use
loopback devices as a fallback.

Then, I started to wonder about an alternative EROFS quick fix to
address the composefs mounts directly for this cycle: since EROFS is the
only fs to support file-backed mounts and other stacked fses will just
bump up `FILESYSTEM_MAX_STACK_DEPTH`, just check that `s_stack_depth`
!= 0 and the backing inode is not from EROFS instead.

At least it works for all known file-backed mount use cases (composefs,
containerd, and Android APEX for some Android vendors), and the fix is
self-contained.

Let's defer increasing FILESYSTEM_MAX_STACK_DEPTH for now.

Fixes: d53cd891f0e4 ("erofs: limit the level of fs stacking for file-backed 
mounts")
Closes: https://github.com/coreos/fedora-coreos-tracker/issues/2087 [1]
Closes: 
https://lore.kernel.org/r/CAFHtUiYv4+=+JP_-JjARWjo6OwcvBj1wtYN=z0qxwcpec9s...@mail.gmail.com
Cc: Amir Goldstein <[email protected]>
Cc: Alexander Larsson <[email protected]>
Cc: Christian Brauner <[email protected]>
Cc: Miklos Szeredi <[email protected]>
Signed-off-by: Gao Xiang <[email protected]>
---

Acked-by: Amir Goldstein <[email protected]>

But you forgot to include details of the stack usage analysis you ran
with erofs+ovl^2 setup.

I am guessing people will want to see this information before relaxing
s_stack_depth in this case.

Sorry I didn't check emails these days, I'm not sure if posting
detailed stack traces are useful, how about adding the following
words:

Didn't mean detailed stack traces, but you did some tests with the
new possible setup and you reached stack usage < 8K so  I think this is

The issue is that my limited stress test setup cannot cover
every cases:

 - I cannot find a way to make direct reclaim reliably in the
   deep memory allocation, is there some suggestion on this?

 - I'm not sure what's the perfered way to evaluate the worst
   stack usage below the block layer, but we should care more
   about increasing delta just out of one more overlayfs I
   guess?

I can only say what I've seen is the peak stack usage of my
fsstress for an erofs+ovl^2 setup on x86_64 is < 8K (7184 bytes,
but I don't think the peak value absolutely useful), which
evaluates RW workloads in the upperdir, and for such workloads,
the stack depth won't be impacted by FILESYSTEM_MAX_STACK_DEPTH,
I don't see such workload is harmful.

And then I manually copyup some files (because I didn't find any
available tool to stress overlayfs copyups) and I could see the
delta is (I think "ovl_copy_up_" is the only one path to do
copyups):

  0)     6688      48   mempool_alloc_slab+0x9/0x20
  1)     6640      56   mempool_alloc_noprof+0x65/0xd0
  2)     6584      72   __sg_alloc_table+0x128/0x190
  3)     6512      40   sg_alloc_table_chained+0x46/0xa0
  4)     6472      64   scsi_alloc_sgtables+0x91/0x2c0
  5)     6408      72   sd_init_command+0x263/0x930
  6)     6336      88   scsi_queue_rq+0x54a/0xb70
  7)     6248     144   blk_mq_dispatch_rq_list+0x265/0x6c0
  8)     6104     144   __blk_mq_sched_dispatch_requests+0x399/0x5c0
  9)     5960      16   blk_mq_sched_dispatch_requests+0x2d/0x70
 10)     5944      56   blk_mq_run_hw_queue+0x208/0x290
 11)     5888      96   blk_mq_dispatch_list+0x13f/0x460
 12)     5792      48   blk_mq_flush_plug_list+0x4b/0x180
 13)     5744      32   blk_add_rq_to_plug+0x3d/0x160
 14)     5712     136   blk_mq_submit_bio+0x4f4/0x760
 15)     5576     120   __submit_bio+0x9b/0x240
 16)     5456      88   submit_bio_noacct_nocheck+0x271/0x330
 17)     5368      72   iomap_bio_read_folio_range+0xde/0x1d0
 18)     5296     112   iomap_read_folio_iter+0x1ee/0x2d0
 19)     5184     264   iomap_readahead+0xb9/0x290
 20)     4920      48   xfs_vm_readahead+0x4a/0x70
 21)     4872     112   read_pages+0x6c/0x1b0
 22)     4760     104   page_cache_ra_unbounded+0x12c/0x210
 23)     4656      80   filemap_readahead.isra.0+0x78/0xb0
 24)     4576     192   filemap_get_pages+0x3a6/0x820
 25)     4384     376   filemap_read+0xde/0x380
 26)     4008      32   xfs_file_buffered_read+0xa6/0xd0
 27)     3976      16   xfs_file_read_iter+0x6a/0xd0
 28)     3960      48   vfs_iocb_iter_read+0xdb/0x140
 29)     3912      88   erofs_fileio_rq_submit+0x136/0x190
 30)     3824     368   z_erofs_runqueue+0x1ce/0x9f0
 31)     3456     232   z_erofs_readahead+0x16c/0x220
 32)     3224     112   read_pages+0x6c/0x1b0
 33)     3112     104   page_cache_ra_unbounded+0x12c/0x210
 34)     3008      80   filemap_readahead.isra.0+0x78/0xb0
 35)     2928     192   filemap_get_pages+0x3a6/0x820
 36)     2736     400   filemap_splice_read+0x12c/0x2f0
 37)     2336      48   backing_file_splice_read+0x3f/0x90
 38)     2288     128   ovl_splice_read+0xef/0x170
 39)     2160     104   splice_direct_to_actor+0xb9/0x260
 40)     2056      88   do_splice_direct+0x76/0xc0
 41)     1968     120   ovl_copy_up_file+0x1a8/0x2b0
 42)     1848     840   ovl_copy_up_one+0x14b0/0x1610
 43)     1008      72   ovl_copy_up_flags+0xd7/0x110
 44)      936      56   ovl_open+0x72/0x110
 45)      880      56   do_dentry_open+0x16c/0x480
 46)      824      40   vfs_open+0x2e/0xf0
 47)      784     152   path_openat+0x80a/0x12e0
 48)      632     296   do_filp_open+0xb8/0x160
 49)      336      80   do_sys_openat2+0x72/0xf0
 50)      256      40   __x64_sys_openat+0x57/0xa0
 51)      216      40   do_syscall_64+0xa4/0x310
 52)      176     176   entry_SYSCALL_64_after_hwframe+0x77/0x7f

And it's still far from the stack overflow of 16k stacks,
because the difference seems only how many (
ovl_splice_read + backing_file_splice_read), and there only takes
hundreds of bytes for each layer.

Finally I used my own rostress to stress RO workloads, and the
deepest stack so far is as below (5456 bytes):

  0)     5456      48   arch_scale_cpu_capacity+0x9/0x30
  1)     5408      16   cpu_util.constprop.0+0x7e/0xe0
  2)     5392     392   sched_balance_find_src_group+0x29f/0xd30
  3)     5000     280   sched_balance_rq+0x1b2/0xf10
  4)     4720     120   pick_next_task_fair+0x23b/0x7b0
  5)     4600     104   __schedule+0x2bc/0xda0
  6)     4496      16   schedule+0x27/0xd0
  7)     4480      24   io_schedule+0x46/0x70
  8)     4456     120   blk_mq_get_tag+0x11b/0x280
  9)     4336      96   __blk_mq_alloc_requests+0x2a1/0x410
 10)     4240     136   blk_mq_submit_bio+0x59c/0x760
 11)     4104     120   __submit_bio+0x9b/0x240
 12)     3984      88   submit_bio_noacct_nocheck+0x271/0x330
 13)     3896      72   iomap_bio_read_folio_range+0xde/0x1d0
 14)     3824     112   iomap_read_folio_iter+0x1ee/0x2d0
 15)     3712     264   iomap_readahead+0xb9/0x290
 16)     3448      48   xfs_vm_readahead+0x4a/0x70
 17)     3400     112   read_pages+0x6c/0x1b0
 18)     3288     104   page_cache_ra_unbounded+0x12c/0x210
 19)     3184      80   filemap_readahead.isra.0+0x78/0xb0
 20)     3104     192   filemap_get_pages+0x3a6/0x820
 21)     2912     376   filemap_read+0xde/0x380
 22)     2536      32   xfs_file_buffered_read+0xa6/0xd0
 23)     2504      16   xfs_file_read_iter+0x6a/0xd0
 24)     2488      48   vfs_iocb_iter_read+0xdb/0x140
 25)     2440      88   erofs_fileio_rq_submit+0x136/0x190
 26)     2352     368   z_erofs_runqueue+0x1ce/0x9f0
 27)     1984     232   z_erofs_readahead+0x16c/0x220
 28)     1752     112   read_pages+0x6c/0x1b0
 29)     1640     104   page_cache_ra_unbounded+0x12c/0x210
 30)     1536      40   force_page_cache_ra+0x96/0xc0
 31)     1496     192   filemap_get_pages+0x123/0x820
 32)     1304     376   filemap_read+0xde/0x380
 33)      928      72   do_iter_readv_writev+0x1b9/0x220
 34)      856      56   vfs_iter_read+0xde/0x140
 35)      800      64   backing_file_read_iter+0x193/0x1e0
 36)      736      56   ovl_read_iter+0x98/0xa0
 37)      680      72   do_iter_readv_writev+0x1b9/0x220
 38)      608      56   vfs_iter_read+0xde/0x140
 39)      552      64   backing_file_read_iter+0x193/0x1e0
 40)      488      56   ovl_read_iter+0x98/0xa0
 41)      432     152   vfs_read+0x21a/0x350
 42)      280      64   __x64_sys_pread64+0x92/0xc0
 43)      216      40   do_syscall_64+0xa4/0x310
 44)      176     176   entry_SYSCALL_64_after_hwframe+0x77/0x7f

something worth mentioning.


Note: There are some observations while evaluating the erofs + ovl^2
setup with an XFS backing fs:

   - Regular RW workloads traverse only one overlayfs layer regardless of
     the value of FILESYSTEM_MAX_STACK_DEPTH, because `upperdir=` cannot
     point to another overlayfs.  Therefore, for pure RW workloads, the
     typical stack is always just:
       overlayfs + upper fs + underlay storage

   - For read-only workloads and the copy-up read part (ovl_splice_read),
     the difference can lie in how many overlays are nested.
     The stack just looks like either:
       ovl + ovl [+ erofs] + backing fs + underlay storage
     or
       ovl [+ erofs] + ext4/xfs + underlay storage

   - The fs reclaim path should be entered only once, so the writeback
     path will not re-enter.

Sorry about my English, and I'm not sure if it's enough (e.g. FUSE
passthrough part).  I will look for your further inputs (and other
acks) before sending this patch upstream.


I think that most people will have problems understanding this
rationale not because of the English, but because of the tech ;)
this is a bit too hand wavy IMO.

Honestly, I don't have better way to describe it, I think we'd
better just to focus more on the increment of one more overlayfs:

FILESYSTEM_MAX_STACK_DEPTH 2 already works for 8k kstacks on
32-bit arches, so I don't think FILESYSTEM_MAX_STACK_DEPTH from
2 to 3, which causes hundreds-more-byte additional stack usage
out of mediate overlayfs on 16k kstacks on 64-bit arches is
harmful (and only RO workloads and copyups are impacted).

And if hundreds-more-byte additional stack usage can overflow
the 16k kstack, I do think then the kernel stack can be
overflowed randomly everywhere in the storage stack, not just
because this FILESYSTEM_MAX_STACK_DEPTH modification.

Thanks,
Gao Xiang


(Also btw, i'm not sure if it's possible to optimize read_iter and
   splice_read stack usage even further in overlayfs, e.g. just
   recursive handling real file/path directly in the top overlayfs
   since the permission check is already done when opening the file.)

Maybe so, but LSM permission to open hook is not the same hook
as permission to read/write.

Thanks,
Amir.


Reply via email to