Re: [PATCH 03/19] fs: release anon dev_t in deactivate_locked_super

2023-09-15 Thread Christian Brauner
> Lifetime rules for fs-private parts of superblock are really private to

Fine, I'll drop that. It's still correct that a filesystem needs to take
care when it frees sb->s_fs_info. See the RCU fun you just encountered.




Re: [PATCH 03/19] fs: release anon dev_t in deactivate_locked_super

2023-09-15 Thread Al Viro
On Fri, Sep 15, 2023 at 03:28:14PM +0100, Al Viro wrote:
> On Fri, Sep 15, 2023 at 04:12:07PM +0200, Christian Brauner wrote:
> > +   static void some_fs_kill_sb(struct super_block *sb)
> > +   {
> > +   struct some_fs_info *info = sb->s_fs_info;
> > +
> > +   kill_*_super(sb);
> > +   kfree(info);
> > +   }
> > +
> > +It's best practice to never deviate from this pattern.
> 
> The last part is flat-out incorrect.  If e.g. fatfs or cifs ever switches
> to that pattern, you'll get UAF - they need freeing of ->s_fs_info
> of anything that ever had been mounted done with RCU delay; moreover,
> unload_nls() in fatfs needs to be behind the same.
> 
> Lifetime rules for fs-private parts of superblock are really private to
> filesystem; their use by sget/sget_fc callbacks might impose restrictions
> on those, but that again is none of the VFS business.

PS: and no, we don't want to impose such RCU delay on every filesystem
out there; what's more, there's nothing to prohibit e.g. having ->s_fs_info
pointing to a refcounted fs-private object (possibly shared by various
superblocks), so freeing might very well be "drop the reference and destroy
if refcount has reached 0".


Re: [PATCH 03/19] fs: release anon dev_t in deactivate_locked_super

2023-09-15 Thread Al Viro
On Fri, Sep 15, 2023 at 04:12:07PM +0200, Christian Brauner wrote:
> + static void some_fs_kill_sb(struct super_block *sb)
> + {
> + struct some_fs_info *info = sb->s_fs_info;
> +
> + kill_*_super(sb);
> + kfree(info);
> + }
> +
> +It's best practice to never deviate from this pattern.

The last part is flat-out incorrect.  If e.g. fatfs or cifs ever switches
to that pattern, you'll get UAF - they need freeing of ->s_fs_info
of anything that ever had been mounted done with RCU delay; moreover,
unload_nls() in fatfs needs to be behind the same.

Lifetime rules for fs-private parts of superblock are really private to
filesystem; their use by sget/sget_fc callbacks might impose restrictions
on those, but that again is none of the VFS business.


Re: [PATCH 03/19] fs: release anon dev_t in deactivate_locked_super

2023-09-15 Thread Christian Brauner
> > tree of any filesystem (in-tree one or not) will have to go through the
> > changes and figure out WTF to do with their existing code.  We are
> > going to play whack-a-mole for at least several years as development
> > branches get rebased and merged.
> 
> Let me write something up.

So here I've written two porting.rst patches that aim to reflect the
current state of things (They do _not_ reflect what's in Christoph's
series here as that'ss again pretty separate and will require additional
spelling out.).

I'm adding explanation for both the old and new logic fwiw. I hope to
upstream these docs soon so we all have something to point to.

>From 200666901f53db74edf309d48e3c74fd275a822a Mon Sep 17 00:00:00 2001
From: Christian Brauner 
Date: Fri, 15 Sep 2023 16:01:02 +0200
Subject: [PATCH 1/2] porting: document new block device opening order

Signed-off-by: Christian Brauner 
---
 Documentation/filesystems/porting.rst | 24 
 1 file changed, 24 insertions(+)

diff --git a/Documentation/filesystems/porting.rst 
b/Documentation/filesystems/porting.rst
index deac4e973ddc..f436b64b77bf 100644
--- a/Documentation/filesystems/porting.rst
+++ b/Documentation/filesystems/porting.rst
@@ -949,3 +949,27 @@ mmap_lock held.  All in-tree users have been audited and 
do not seem to
 depend on the mmap_lock being held, but out of tree users should verify
 for themselves.  If they do need it, they can return VM_FAULT_RETRY to
 be called with the mmap_lock held.
+
+---
+
+**mandatory**
+
+The order of opening block devices and matching or creating superblocks has
+changed.
+
+The old logic opened block devices first and then tried to find a
+suitable superblock to reuse based on the block device pointer.
+
+The new logic finds or creates a superblock first, opening block devices
+afterwards. Since opening block devices cannot happen under s_umount because of
+lock ordering requirements s_umount is now dropped while opening block
+devices and reacquired before calling fill_super().
+
+In the old logic concurrent mounters would find the superblock on the list of
+active superblock for the filesystem type. Since the first opener of the block
+device would hold s_umount they would wait until the superblock became either
+born or died prematurely due to initialization failure.
+
+Since the new logic drops s_umount concurrent mounters could grab s_umount and
+would spin. Instead they are now made to wait using an explicit wait-wake
+mechanism without having to hold s_umount.
-- 
2.34.1

>From 1f09898322b4402219d8d3219d399c9e56a76bae Mon Sep 17 00:00:00 2001
From: Christian Brauner 
Date: Fri, 15 Sep 2023 16:01:40 +0200
Subject: [PATCH 2/2] porting: document superblock as block device holder

Signed-off-by: Christian Brauner 
---
 Documentation/filesystems/porting.rst | 79 +++
 1 file changed, 79 insertions(+)

diff --git a/Documentation/filesystems/porting.rst 
b/Documentation/filesystems/porting.rst
index f436b64b77bf..fefefaf289b4 100644
--- a/Documentation/filesystems/porting.rst
+++ b/Documentation/filesystems/porting.rst
@@ -973,3 +973,82 @@ born or died prematurely due to initialization failure.
 Since the new logic drops s_umount concurrent mounters could grab s_umount and
 would spin. Instead they are now made to wait using an explicit wait-wake
 mechanism without having to hold s_umount.
+
+---
+
+**mandatory**
+
+The holder of a block device is now the superblock.
+
+The holder of a block device used to be the file_system_type which wasn't
+particularly useful. It wasn't possible to go from block device to owning
+superblock without matching on the device pointer stored in the superblock.
+This mechanism would only work for a single device so the block layer couldn't
+find the owning superblock associated with additional devices.
+
+In the old mechanism reusing or creating a superblock for racing mount(2) and
+umount(2) relied on the file_system_type as the holder. This was severly
+underdocumented however:
+
+(1) If the concurrent mount(2) managed to grab an active reference before the
+umount(2) dropped the last active reference in deactivate_locked_super()
+the mounter would simply reuse the existing superblock.
+
+(2) If the mounter came after deactivate_locked_super() but before
+the superblock had been removed from the list of superblocks of the
+filesystem type the mounter would wait until the superblock was shutdown
+and allocated a new superblock.
+
+(3) If the mounter came after deactivate_locked_super() and after
+the superblock had been removed from the list of superblocks of the
+filesystem type the mounter would allocate a new superblock.
+
+Because the holder of the block device was the filesystem type any concurrent
+mounter could open the block device without risking seeing EBUSY because the
+block device was still in use.
+
+Making the superblock the owner of the block device changes this as the holder
+is now a unique 

Re: [PATCH 03/19] fs: release anon dev_t in deactivate_locked_super

2023-09-15 Thread Christian Brauner
On Thu, Sep 14, 2023 at 05:58:05PM +0100, Al Viro wrote:
> On Thu, Sep 14, 2023 at 04:02:25PM +0200, Christian Brauner wrote:
> 
> > Yes, you're right that making the superblock and not the filesytem type
> > the bd_holder changes the logic and we are aware of that of course. And
> > it requires changes such as moving additional block device closing from
> > where some callers currently do it.
> 
> Details, please?

Filesystems like xfs and ext4 that closed additional block devices (For
example, the logdev= mount option for xfs.) in put_super() could go
through stuff like:

blkdev_put()
-> bdev->bd_disk->fops->release() == lo_release()
   -> __loop_clr_fd()
  -> disk_force_media_change()
 -> __invalidate_device()
-> get_super()

which wouldn't have been a problem before because get_super() matched on
sb->s_bdev which obviously doesn't work because a log device or rt
device or whatever isn't the main block device. So we couldn't have
deadlocked.

But the fact that it is called in that manner from that place in the
first place is wildly adventurous especially considering that there
isn't __a single comment__ in that code why that is safe. So good luck
figuring this all out.

Now that we don't have to do that s_bdev matching thing anymore because
we directly associate the superblock with the block device we can go
straight from block device to superblock. But now calling blkdev_put()
under put_super() which holds s_umount could deadlock. So it's moved to
kill_sb where it should've always been called. Even without the
potential deadlock in the new scheme that's cleaner and easier to
understand imho and it just works for any block device.

> Note that Christoph's series has mashed (2) and (3) together, resulting
> in UAF in a bunch of places.  And I'm dead serious about

Yes, that I did fix as far as I'm aware. If the rules would've been
written down where when something was freed we would've had an easier
time figuring this out though. But they weren't so we missed it.

> Documentation/filesystems/porting being the right place; any development

Yes, agreed. I'll write a document for Christoph's next version.

I know that what you're saying is roughly that we shouldn't make the
same mistake as were done before but the fact that the old lifetime
rules weren't documented in any meaningful way and now we get grumbled
at in turn makes me grumble a bit. :) But overall point duly taken.

> tree of any filesystem (in-tree one or not) will have to go through the
> changes and figure out WTF to do with their existing code.  We are
> going to play whack-a-mole for at least several years as development
> branches get rebased and merged.

Let me write something up.

> 
> Incidentally, I'm going to add a (belated by 10 years) chunk in porting.rst
> re making sure that anything in superblock that might be needed by methods
> called in RCU mode should *not* be freed without an RCU delay...  Should've
> done that back in 3.12 merge window when RCU'd vfsmounts went in; as it
> is, today we have several filesystems with exact same kind of breakage.
> hfsplus and affs breakage had been there in 3.13 (missed those two), exfat
> and ntfs3 - introduced later, by initial merges of filesystems in question.
> Missed on review...

Cool, thanks for adding that.


Re: [PATCH 03/19] fs: release anon dev_t in deactivate_locked_super

2023-09-15 Thread Christian Brauner
On Thu, Sep 14, 2023 at 08:23:31PM +0100, Al Viro wrote:
> On Thu, Sep 14, 2023 at 05:58:05PM +0100, Al Viro wrote:
> 
> > Incidentally, I'm going to add a (belated by 10 years) chunk in porting.rst
> > re making sure that anything in superblock that might be needed by methods
> > called in RCU mode should *not* be freed without an RCU delay...  Should've
> > done that back in 3.12 merge window when RCU'd vfsmounts went in; as it
> > is, today we have several filesystems with exact same kind of breakage.
> > hfsplus and affs breakage had been there in 3.13 (missed those two), exfat
> > and ntfs3 - introduced later, by initial merges of filesystems in question.
> > Missed on review...
> > 
> > Hell knows - perhaps Documentation/filesystems/whack-a-mole might be a good
> > idea...

pitfalls.rst or common-bugs.rst

or something like that.

> 
> Actually, utf8 casefolding stuff also has the same problem, so ext4 and f2fs
> with casefolding are also affected ;-/




Re: [PATCH 03/19] fs: release anon dev_t in deactivate_locked_super

2023-09-14 Thread Al Viro
On Thu, Sep 14, 2023 at 05:58:05PM +0100, Al Viro wrote:

> Incidentally, I'm going to add a (belated by 10 years) chunk in porting.rst
> re making sure that anything in superblock that might be needed by methods
> called in RCU mode should *not* be freed without an RCU delay...  Should've
> done that back in 3.12 merge window when RCU'd vfsmounts went in; as it
> is, today we have several filesystems with exact same kind of breakage.
> hfsplus and affs breakage had been there in 3.13 (missed those two), exfat
> and ntfs3 - introduced later, by initial merges of filesystems in question.
> Missed on review...
> 
> Hell knows - perhaps Documentation/filesystems/whack-a-mole might be a good
> idea...

Actually, utf8 casefolding stuff also has the same problem, so ext4 and f2fs
with casefolding are also affected ;-/


Re: [PATCH 03/19] fs: release anon dev_t in deactivate_locked_super

2023-09-14 Thread Al Viro
On Thu, Sep 14, 2023 at 04:02:25PM +0200, Christian Brauner wrote:

> Yes, you're right that making the superblock and not the filesytem type
> the bd_holder changes the logic and we are aware of that of course. And
> it requires changes such as moving additional block device closing from
> where some callers currently do it.

Details, please?

> But the filesytem type is not a very useful holder itself and has other
> drawbacks. The obvious one being that it requires us to wade through all
> superblocks on the system trying to find the superblock associated with
> a given block device continously grabbing and dropping sb_lock and
> s_umount. None of that is very pleasant nor elegant and it is for sure
> not very easy to understand (Plus, it's broken for btrfs freezing and
> syncing via block level ioctls.).

"Constantly" is a bit of a stretch - IIRC, we grabbed sb_lock once, then
went through the list comparing ->s_bdev (without any extra locking),
then bumped ->s_count on the found superblock, dropped sb_lock,
grabbed ->s_umount on the sucker and verified it's still alive.

Repeated grabbing of any lock happened only on a race with fs shutdown;
normal case is one spin_lock, one spin_unlock, one down_read().

Oh, well...

> Using the superblock as holder makes this go away and is overall a lot
> more useful and intuitive and can be extended to filesystems with
> multiple devices (Of which we apparently are bound to get more.).
>
> So I think this change is worth the pain.
> 
> It's a fair point that these lifetime rules should be documented in
> Documentation/filesystems/. The old lifetime documentation is too sparse
> to be useful though.

What *are* these lifetime rules?  Seriously, you have 3 chunks of
fs-dependent actions at the moment:
* the things needed to get rid of internal references pinning
inodes/dentries + whatever else we need done before generic_shutdown_super()
* the stuff to be done between generic_shutdown_super() and
making the sucker invisible to sget()/sget_fc()
* the stuff that must be done after we are sure that sget
callbacks won't be looking at this instance.

Note that Christoph's series has mashed (2) and (3) together, resulting
in UAF in a bunch of places.  And I'm dead serious about
Documentation/filesystems/porting being the right place; any development
tree of any filesystem (in-tree one or not) will have to go through the
changes and figure out WTF to do with their existing code.  We are
going to play whack-a-mole for at least several years as development
branches get rebased and merged.

Incidentally, I'm going to add a (belated by 10 years) chunk in porting.rst
re making sure that anything in superblock that might be needed by methods
called in RCU mode should *not* be freed without an RCU delay...  Should've
done that back in 3.12 merge window when RCU'd vfsmounts went in; as it
is, today we have several filesystems with exact same kind of breakage.
hfsplus and affs breakage had been there in 3.13 (missed those two), exfat
and ntfs3 - introduced later, by initial merges of filesystems in question.
Missed on review...

Hell knows - perhaps Documentation/filesystems/whack-a-mole might be a good
idea...


Re: [PATCH 03/19] fs: release anon dev_t in deactivate_locked_super

2023-09-14 Thread Christian Brauner
> Christoph, could you explain what the hell do we need that for?  It does
> create the race in question and AFAICS 2c18a63b760a (and followups trying
> to plug holes in it) had been nothing but headache.
> 
> Old logics: if mount attempt with a different fs type happens, -EBUSY
> is precisely corrent - we would've gotten just that if mount() came
> before umount().  If the type matches, we might
>   1) come before deactivate_locked_super() by umount(2).
> No problem, we succeed.
>   2) come after the beginning of shutdown, but before the
> removal from the list; fine, we'll wait for the sucker to be
> unlocked (which happens in the end of generic_shutdown_super()),
> notice it's dead and create a new superblock.  Since the only
> part left on the umount side is closing the device, we are
> just fine.
>   3) come after the removal from the list.  So we won't
> wait for the old superblock to be unlocked, other than that
> it's exactly the same as (2).  It doesn't matter whether we
> open the device before or after close by umount - same owner
> anyway, no -EBUSY.
> 
> Your "owner shall be the superblock" breaks that...
> 
> If you want to mess with _three_-way split of ->kill_sb(),
> please start with writing down the rules re what should
> go into each of those parts; such writeup should go into
> Documentation/filesystems/porting anyway, even if the
> split is a two-way one, BTW.

Hm, I think that characterization of Christoph's changes is a bit harsh.

Yes, you're right that making the superblock and not the filesytem type
the bd_holder changes the logic and we are aware of that of course. And
it requires changes such as moving additional block device closing from
where some callers currently do it.

But the filesytem type is not a very useful holder itself and has other
drawbacks. The obvious one being that it requires us to wade through all
superblocks on the system trying to find the superblock associated with
a given block device continously grabbing and dropping sb_lock and
s_umount. None of that is very pleasant nor elegant and it is for sure
not very easy to understand (Plus, it's broken for btrfs freezing and
syncing via block level ioctls.).

Using the superblock as holder makes this go away and is overall a lot
more useful and intuitive and can be extended to filesystems with
multiple devices (Of which we apparently are bound to get more.).

So I think this change is worth the pain.

It's a fair point that these lifetime rules should be documented in
Documentation/filesystems/. The old lifetime documentation is too sparse
to be useful though.


Re: [PATCH 03/19] fs: release anon dev_t in deactivate_locked_super

2023-09-14 Thread Christian Brauner
> BTW, this part of commit message in 2c18a63b760a is rather confused:
> Recent rework moved block device closing out of sb->put_super() and into
> sb->kill_sb() to avoid deadlocks as s_umount is held in put_super() and
> blkdev_put() can end up taking s_umount again.
> 
> That was *NOT* what a recent rework had done.  Block device closing had never
> been inside ->put_super() - at no point since that (closing, that is) had been
> introduced back in 0.97 ;-)  ->put_super() predates it (0.95c+).

I think the commit message probably just isn't clear enough. The main
block device of a superblock isn't closed in sb->put_super(). That's
always been closed in kill_block_super() after generic_shutdown_super().

But afaict filesystem like ext4 and xfs may have additional block
devices open exclusively and closed them in sb->put_super():

xfs_fs_put_super()
-> xfs_close_devices()
   -> xfs_blkdev_put()
  -> blkdev_put()

ext4_put_super()
-> ext4_blkdev_remove()
   -> blkdev_put()


Re: [PATCH 03/19] fs: release anon dev_t in deactivate_locked_super

2023-09-13 Thread Al Viro
On Thu, Sep 14, 2023 at 03:37:05AM +0100, Al Viro wrote:
> On Thu, Sep 14, 2023 at 12:27:12AM +0100, Al Viro wrote:
> > On Wed, Sep 13, 2023 at 08:09:57AM -0300, Christoph Hellwig wrote:
> > > Releasing an anon dev_t is a very common thing when freeing a
> > > super_block, as that's done for basically any not block based file
> > > system (modulo the odd mtd special case).  So instead of requiring
> > > a special ->kill_sb helper and a lot of boilerplate in more complicated
> > > file systems, just release the anon dev_t in deactivate_locked_super if
> > > the super_block was using one.
> > > 
> > > As the freeing is done after the main call to kill_super_notify, this
> > > removes the need for having two slightly different call sites for it.
> > 
> > Huh?  At this stage in your series freeing is still in ->kill_sb()
> > instances, after the calls of kill_anon_super() you've turned into
> > the calls of generic_shutdown_super().
> > 
> > You do split it off into a separate method later in the series, but
> > at this point you are reopening the same UAF that had been dealt with
> > in dc3216b14160 "super: ensure valid info".
> > 
> > Either move the introduction of ->free_sb() before that one, or
> > split it into lifting put_anon_bdev() (left here) and getting rid
> > of kill_anon_super() (after ->free_sb() introduction).
> 
> Actually, looking at the final stage in the series, you still have
> kill_super_notify() done *AFTER* ->free_sb() call.  So the problem
> persists until the very end...

It's worse - look at the rationale for 2c18a63b760a "super: wait until
we passed kill super".  Basically, "don't remove from the lists
until after block device closing".  IOW, we have

* stuff that needs to be done before generic_shutdown_super() (things
like pinned dentries on ramfs, etc.)
* generic_shutdown_super() itself (dentry/inode eviction, optionally
->put_super())
* stuff that needs to be done before eviction from the lists (block
device closing, since 2c18a63b760a)
* eviction from the lists
* stuff that needs to be done *after* eviction from the lists.

BTW, this part of commit message in 2c18a63b760a is rather confused:
Recent rework moved block device closing out of sb->put_super() and into
sb->kill_sb() to avoid deadlocks as s_umount is held in put_super() and
blkdev_put() can end up taking s_umount again.

That was *NOT* what a recent rework had done.  Block device closing had never
been inside ->put_super() - at no point since that (closing, that is) had been
introduced back in 0.97 ;-)  ->put_super() predates it (0.95c+).

The race is real, but the cause is not some kind of move of blkdev_put().
Your 2ea6f68932f7 "fs: use the super_block as holder when mounting file
systems" is where it actually came from.

Christoph, could you explain what the hell do we need that for?  It does
create the race in question and AFAICS 2c18a63b760a (and followups trying
to plug holes in it) had been nothing but headache.

Old logics: if mount attempt with a different fs type happens, -EBUSY
is precisely corrent - we would've gotten just that if mount() came
before umount().  If the type matches, we might
1) come before deactivate_locked_super() by umount(2).
No problem, we succeed.
2) come after the beginning of shutdown, but before the
removal from the list; fine, we'll wait for the sucker to be
unlocked (which happens in the end of generic_shutdown_super()),
notice it's dead and create a new superblock.  Since the only
part left on the umount side is closing the device, we are
just fine.
3) come after the removal from the list.  So we won't
wait for the old superblock to be unlocked, other than that
it's exactly the same as (2).  It doesn't matter whether we
open the device before or after close by umount - same owner
anyway, no -EBUSY.

Your "owner shall be the superblock" breaks that...

If you want to mess with _three_-way split of ->kill_sb(),
please start with writing down the rules re what should
go into each of those parts; such writeup should go into
Documentation/filesystems/porting anyway, even if the
split is a two-way one, BTW.


Re: [PATCH 03/19] fs: release anon dev_t in deactivate_locked_super

2023-09-13 Thread Al Viro
On Thu, Sep 14, 2023 at 12:27:12AM +0100, Al Viro wrote:
> On Wed, Sep 13, 2023 at 08:09:57AM -0300, Christoph Hellwig wrote:
> > Releasing an anon dev_t is a very common thing when freeing a
> > super_block, as that's done for basically any not block based file
> > system (modulo the odd mtd special case).  So instead of requiring
> > a special ->kill_sb helper and a lot of boilerplate in more complicated
> > file systems, just release the anon dev_t in deactivate_locked_super if
> > the super_block was using one.
> > 
> > As the freeing is done after the main call to kill_super_notify, this
> > removes the need for having two slightly different call sites for it.
> 
> Huh?  At this stage in your series freeing is still in ->kill_sb()
> instances, after the calls of kill_anon_super() you've turned into
> the calls of generic_shutdown_super().
> 
> You do split it off into a separate method later in the series, but
> at this point you are reopening the same UAF that had been dealt with
> in dc3216b14160 "super: ensure valid info".
> 
> Either move the introduction of ->free_sb() before that one, or
> split it into lifting put_anon_bdev() (left here) and getting rid
> of kill_anon_super() (after ->free_sb() introduction).

Actually, looking at the final stage in the series, you still have
kill_super_notify() done *AFTER* ->free_sb() call.  So the problem
persists until the very end...


Re: [PATCH 03/19] fs: release anon dev_t in deactivate_locked_super

2023-09-13 Thread Al Viro
On Wed, Sep 13, 2023 at 08:09:57AM -0300, Christoph Hellwig wrote:
> Releasing an anon dev_t is a very common thing when freeing a
> super_block, as that's done for basically any not block based file
> system (modulo the odd mtd special case).  So instead of requiring
> a special ->kill_sb helper and a lot of boilerplate in more complicated
> file systems, just release the anon dev_t in deactivate_locked_super if
> the super_block was using one.
> 
> As the freeing is done after the main call to kill_super_notify, this
> removes the need for having two slightly different call sites for it.

Huh?  At this stage in your series freeing is still in ->kill_sb()
instances, after the calls of kill_anon_super() you've turned into
the calls of generic_shutdown_super().

You do split it off into a separate method later in the series, but
at this point you are reopening the same UAF that had been dealt with
in dc3216b14160 "super: ensure valid info".

Either move the introduction of ->free_sb() before that one, or
split it into lifting put_anon_bdev() (left here) and getting rid
of kill_anon_super() (after ->free_sb() introduction).