from:"Christian Brauner"

Re: [PATCH v2 0/4] inode_init_always zeroing i_state

2024-06-12 Thread Christian Brauner

On Tue, 11 Jun 2024 14:06:22 +0200, Mateusz Guzik wrote:
> As requested by Jan this is a 4-part series.
> 
> I diffed this against fs-next + my inode hash patch v3 as it adds one
> i_state = 0 case. Should that hash thing not be accepted this bit is
> trivially droppable from the patch.
> 
> Mateusz Guzik (4):
>   xfs: preserve i_state around inode_init_always in xfs_reinit_inode
>   vfs: partially sanitize i_state zeroing on inode creation
>   xfs: remove now spurious i_state initialization in xfs_inode_alloc
>   bcachefs: remove now spurious i_state initialization
> 
> [...]

Applied to the vfs.inode.rcu branch of the vfs/vfs.git tree.
Patches in the vfs.inode.rcu branch should appear in linux-next soon.

Please report any outstanding bugs that were missed during review in a
new review to the original patch series allowing us to drop it.

It's encouraged to provide Acked-bys and Reviewed-bys even though the
patch has now been applied. If possible patch trailers will be updated.

Note that commit hashes shown below are subject to change due to rebase,
trailer updates or similar. If in doubt, please check the listed branch.

tree:   https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git
branch: vfs.inode.rcu

[1/4] xfs: preserve i_state around inode_init_always in xfs_reinit_inode
  https://git.kernel.org/vfs/vfs/c/f6f496712632
[2/4] vfs: partially sanitize i_state zeroing on inode creation
  https://git.kernel.org/vfs/vfs/c/1fddfb5628e4
[3/4] xfs: remove now spurious i_state initialization in xfs_inode_alloc
  https://git.kernel.org/vfs/vfs/c/c0a6bf1d02d8
[4/4] bcachefs: remove now spurious i_state initialization
  https://git.kernel.org/vfs/vfs/c/9ed6c60e6053

Re: [f2fs-dev] [PATCH v18 0/7] Case insensitive cleanup for ext4/f2fs

2024-06-07 Thread Christian Brauner

> Christian, can you please take a look? Eric has also been involved in

Thanks! Looks good to me. I've picked it up for testing in -next.


___
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel

Re: [f2fs-dev] [PATCH v18 0/7] Case insensitive cleanup for ext4/f2fs

2024-06-07 Thread Christian Brauner

On Thu, 06 Jun 2024 10:33:46 +0300, Eugen Hristev wrote:
> I am trying to respin the series here :
> https://www.spinics.net/lists/linux-ext4/msg85081.html
> 
> I resent some of the v9 patches and got some reviews from Gabriel,
> I did changes as requested and here is v18.
> 
> Changes in v18:
> - in patch 2/7 removed the check for folded_name->len
> - in patch 4/7 simplified the use of generic_ci_match
> 
> [...]

Applied to the vfs.casefold branch of the vfs/vfs.git tree.
Patches in the vfs.casefold branch should appear in linux-next soon.

Please report any outstanding bugs that were missed during review in a
new review to the original patch series allowing us to drop it.

It's encouraged to provide Acked-bys and Reviewed-bys even though the
patch has now been applied. If possible patch trailers will be updated.

Note that commit hashes shown below are subject to change due to rebase,
trailer updates or similar. If in doubt, please check the listed branch.

tree:   https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git
branch: vfs.casefold

[1/7] ext4: Simplify the handling of cached casefolded names
  https://git.kernel.org/vfs/vfs/c/f776f02a2c96
[2/7] f2fs: Simplify the handling of cached casefolded names
  https://git.kernel.org/vfs/vfs/c/632f4054b229
[3/7] libfs: Introduce case-insensitive string comparison helper
  https://git.kernel.org/vfs/vfs/c/6a79a4e187bd
[4/7] ext4: Reuse generic_ci_match for ci comparisons
  https://git.kernel.org/vfs/vfs/c/d76b92f61f3b
[5/7] f2fs: Reuse generic_ci_match for ci comparisons
  https://git.kernel.org/vfs/vfs/c/d66858eb0c72
[6/7] ext4: Move CONFIG_UNICODE defguards into the code flow
  https://git.kernel.org/vfs/vfs/c/d98c822232f8
[7/7] f2fs: Move CONFIG_UNICODE defguards into the code flow
  https://git.kernel.org/vfs/vfs/c/28add38d545f


___
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel

Re: [PATCH v2] hostfs: convert hostfs to use the new mount API

2024-06-03 Thread Christian Brauner

On Thu, 30 May 2024 20:01:11 +0800, Hongbo Li wrote:
> Convert the hostfs filesystem to the new internal mount API as the old
> one will be obsoleted and removed.  This allows greater flexibility in
> communication of mount parameters between userspace, the VFS and the
> filesystem.
> 
> See Documentation/filesystems/mount_api.txt for more information.
> 
> [...]

Applied to the vfs.mount.api branch of the vfs/vfs.git tree.
Patches in the vfs.mount.api branch should appear in linux-next soon.

Please report any outstanding bugs that were missed during review in a
new review to the original patch series allowing us to drop it.

It's encouraged to provide Acked-bys and Reviewed-bys even though the
patch has now been applied. If possible patch trailers will be updated.

Note that commit hashes shown below are subject to change due to rebase,
trailer updates or similar. If in doubt, please check the listed branch.

tree:   https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git
branch: vfs.mount.api

[1/1] hostfs: convert hostfs to use the new mount API
  https://git.kernel.org/vfs/vfs/c/cd140ce9f611

Re: [PATCH v3 00/12] cachefiles: some bugfixes and cleanups for ondemand requests

2024-05-29 Thread Christian Brauner

On Wed, 22 May 2024 19:42:56 +0800, libao...@huaweicloud.com wrote:
> From: Baokun Li 
> 
> Hi all!
> 
> This is the third version of this patch series. The new version has no
> functional changes compared to the previous one, so I've kept the previous
> Acked-by and Reviewed-by, so please let me know if you have any objections.
> 
> [...]

So I've taken that as a fixes series which should probably make it upstream
rather sooner than later. Correct?

---

Applied to the vfs.fixes branch of the vfs/vfs.git tree.
Patches in the vfs.fixes branch should appear in linux-next soon.

Please report any outstanding bugs that were missed during review in a
new review to the original patch series allowing us to drop it.

It's encouraged to provide Acked-bys and Reviewed-bys even though the
patch has now been applied. If possible patch trailers will be updated.

Note that commit hashes shown below are subject to change due to rebase,
trailer updates or similar. If in doubt, please check the listed branch.

tree:   https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git
branch: vfs.fixes

[01/12] cachefiles: add output string to cachefiles_obj_[get|put]_ondemand_fd
https://git.kernel.org/vfs/vfs/c/cc5ac966f261
[02/12] cachefiles: remove requests from xarray during flushing requests
https://git.kernel.org/vfs/vfs/c/0fc75c5940fa
[03/12] cachefiles: fix slab-use-after-free in cachefiles_ondemand_get_fd()
https://git.kernel.org/vfs/vfs/c/de3e26f9e5b7
[04/12] cachefiles: fix slab-use-after-free in cachefiles_ondemand_daemon_read()
https://git.kernel.org/vfs/vfs/c/da4a82741606
[05/12] cachefiles: remove err_put_fd label in cachefiles_ondemand_daemon_read()
https://git.kernel.org/vfs/vfs/c/3e6d704f02aa
[06/12] cachefiles: add consistency check for copen/cread
https://git.kernel.org/vfs/vfs/c/a26dc49df37e
[07/12] cachefiles: add spin_lock for cachefiles_ondemand_info
https://git.kernel.org/vfs/vfs/c/0a790040838c
[08/12] cachefiles: never get a new anonymous fd if ondemand_id is valid
https://git.kernel.org/vfs/vfs/c/4988e35e95fc
[09/12] cachefiles: defer exposing anon_fd until after copy_to_user() succeeds
https://git.kernel.org/vfs/vfs/c/4b4391e77a6b
[10/12] cachefiles: Set object to close if ondemand_id < 0 in copen
https://git.kernel.org/vfs/vfs/c/4f8703fb3482
[11/12] cachefiles: flush all requests after setting CACHEFILES_DEAD
https://git.kernel.org/vfs/vfs/c/85e833cd7243
[12/12] cachefiles: make on-demand read killable
https://git.kernel.org/vfs/vfs/c/bc9dde615546

Re: [PATCH v2] fs: remove accidental overflow during wraparound check

2024-05-15 Thread Christian Brauner

On Mon, 13 May 2024 17:50:30 +, Justin Stitt wrote:
> Running syzkaller with the newly enabled signed integer overflow
> sanitizer produces this report:
> 
> [  195.401651] [ cut here ]
> [  195.404808] UBSAN: signed-integer-overflow in ../fs/open.c:321:15
> [  195.408739] 9223372036854775807 + 562984447377399 cannot be represented in 
> type 'loff_t' (aka 'long long')
> [  195.414683] CPU: 1 PID: 703 Comm: syz-executor.0 Not tainted 
> 6.8.0-rc2-00039-g14de58dbe653-dirty #11
> [  195.420138] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
> 1.16.3-debian-1.16.3-2 04/01/2014
> [  195.425804] Call Trace:
> [  195.427360]  
> [  195.428791]  dump_stack_lvl+0x93/0xd0
> [  195.431150]  handle_overflow+0x171/0x1b0
> [  195.433640]  vfs_fallocate+0x459/0x4f0
> ...
> [  195.490053] [ cut here ]
> [  195.493146] UBSAN: signed-integer-overflow in ../fs/open.c:321:61
> [  195.497030] 9223372036854775807 + 562984447377399 cannot be represented in 
> type 'loff_t' (aka 'long long)
> [  195.502940] CPU: 1 PID: 703 Comm: syz-executor.0 Not tainted 
> 6.8.0-rc2-00039-g14de58dbe653-dirty #11
> [  195.508395] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
> 1.16.3-debian-1.16.3-2 04/01/2014
> [  195.514075] Call Trace:
> [  195.515636]  
> [  195.517000]  dump_stack_lvl+0x93/0xd0
> [  195.519255]  handle_overflow+0x171/0x1b0
> [  195.521677]  vfs_fallocate+0x4cb/0x4f0
> [  195.524033]  __x64_sys_fallocate+0xb2/0xf0
> 
> [...]

Applied to the vfs.misc branch of the vfs/vfs.git tree.
Patches in the vfs.misc branch should appear in linux-next soon.

Please report any outstanding bugs that were missed during review in a
new review to the original patch series allowing us to drop it.

It's encouraged to provide Acked-bys and Reviewed-bys even though the
patch has now been applied. If possible patch trailers will be updated.

Note that commit hashes shown below are subject to change due to rebase,
trailer updates or similar. If in doubt, please check the listed branch.

tree:   https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git
branch: vfs.misc

[1/1] fs: remove accidental overflow during wraparound check
  https://git.kernel.org/vfs/vfs/c/c01a23b6fbd1

Re: [Linaro-mm-sig] Re: [PATCH] epoll: try to be a _bit_ better about file lifetimes

2024-05-10 Thread Christian Brauner

> For the uapi issue you describe below my take would be that we should just
> try, and hope that everyone's been dutifully using O_CLOEXEC. But maybe
> I'm biased from the gpu world, where we've been hammering it in that
> "O_CLOEXEC or bust" mantra since well over a decade. Really the only valid

Oh, we're very much on the same page. All new file descriptor types that
I've added over the years are O_CLOEXEC by default. IOW, you need to
remove O_CLOEXEC explicitly (see pidfd as an example). And imho, any new
fd type that's added should just be O_CLOEXEC by default.

Re: [Linaro-mm-sig] Re: [PATCH] epoll: try to be a _bit_ better about file lifetimes

2024-05-10 Thread Christian Brauner

On Thu, May 09, 2024 at 08:48:20AM -0700, Linus Torvalds wrote:
> On Thu, 9 May 2024 at 04:39, Christian Brauner  wrote:
> >
> > Not worth it without someone explaining in detail why imho. First pass
> > should be to try and replace kcmp() in scenarios where it's obviously
> > not needed or overkill.
> 
> Ack.
> 
> > I've added a CLASS(fd_raw) in a preliminary patch since we'll need that
> > anyway which means that your comparison patch becomes even simpler imho.
> > I've also added a selftest patch:
> >
> > https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git/log/?h=vfs.misc
> 
> LGTM.
> 
> Maybe worth adding an explicit test for "open same file, but two
> separate opens, F_DUPFD_QUERY returns 0? Just to clarify the "it's not
> testing the file on the filesystem for equality, but the file pointer
> itself".

Yep, good point. Added now.

Re: [Linaro-mm-sig] Re: [PATCH] epoll: try to be a _bit_ better about file lifetimes

2024-05-09 Thread Christian Brauner

On Wed, May 08, 2024 at 10:14:44AM -0700, Linus Torvalds wrote:
> On Wed, 8 May 2024 at 09:19, Linus Torvalds
>  wrote:
> >
> > So since we already have two versions of F_DUPFD (the other being
> > F_DUPFD_CLOEXEC) I decided that the best thing to do is to just extend
> > on that existing naming pattern, and called it F_DUPFD_QUERY instead.
> >
> > I'm not married to the name, so if somebody hates it, feel free to
> > argue otherwise.
> 
> Side note: with this patch, doing
> 
>ret = fcntl(fd1, F_DUPFD_QUERY, fd2);
> 
> will result in:
> 
>  -1 (EBADF): 'fd1' is not a valid file descriptor
>  -1 (EINVAL): old kernel that doesn't support F_DUPFD_QUERY
>  0: fd2 does not refer to the same file as fd1
>  1: fd2 is the same 'struct file' as fd1
> 
> and it might be worth noting a couple of things here:
> 
>  (a) fd2 being an invalid file descriptor does not cause EBADF, it
> just causes "does not match".
> 
>  (b) we *could* use more bits for more equality
> 
> IOW, it would possibly make sense to extend the 0/1 result to be
> 
> - bit #0: same file pointer
> - bit #1: same path
> - bit #2: same dentry
> - bit #3: same inode
> 
> which are all different levels of "sameness".

Not worth it without someone explaining in detail why imho. First pass
should be to try and replace kcmp() in scenarios where it's obviously
not needed or overkill.

I've added a CLASS(fd_raw) in a preliminary patch since we'll need that
anyway which means that your comparison patch becomes even simpler imho.
I've also added a selftest patch:

https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git/log/?h=vfs.misc

?

Re: [Linaro-mm-sig] Re: [PATCH] epoll: try to be a _bit_ better about file lifetimes

2024-05-08 Thread Christian Brauner

On Wed, May 08, 2024 at 10:32:08AM +0200, Daniel Vetter wrote:
> On Wed, May 08, 2024 at 07:55:08AM +0200, Christian König wrote:
> > Am 07.05.24 um 21:07 schrieb Linus Torvalds:
> > > On Tue, 7 May 2024 at 11:04, Daniel Vetter  wrote:
> > > > On Tue, May 07, 2024 at 09:46:31AM -0700, Linus Torvalds wrote:
> > > > 
> > > > > I'd be perfectly ok with adding a generic "FISAME" VFS level ioctl
> > > > > too, if this is possibly a more common thing. and not just DRM wants
> > > > > it.
> > > > > 
> > > > > Would something like that work for you?
> > > > Yes.
> > > > 
> > > > Adding Simon and Pekka as two of the usual suspects for this kind of
> > > > stuff. Also example code (the int return value is just so that callers 
> > > > know
> > > > when kcmp isn't available, they all only care about equality):
> > > > 
> > > > https://gitlab.freedesktop.org/mesa/mesa/-/blob/main/src/util/os_file.c#L239
> > > That example thing shows that we shouldn't make it a FISAME ioctl - we
> > > should make it a fcntl() instead, and it would just be a companion to
> > > F_DUPFD.
> > > 
> > > Doesn't that strike everybody as a *much* cleaner interface? I think
> > > F_ISDUP would work very naturally indeed with F_DUPFD.
> > > 
> > > Yes? No?
> > 
> > Sounds absolutely sane to me.
> 
> Yeah fcntl(fd1, F_ISDUP, fd2); sounds extremely reasonable to me too.
> 
> Aside, after some irc discussions I paged a few more of the relevant info
> back in, and at least for dma-buf we kinda sorted this out by going away
> from the singleton inode in this patch: ed63bb1d1f84 ("dma-buf: give each
> buffer a full-fledged inode")
> 
> It's uapi now so we can't ever undo that, but with hindsight just the
> F_ISDUP is really what we wanted. Because we have no need for that inode
> aside from the unique inode number that's only used to compare dma-buf fd
> for sameness, e.g.
> 
> https://gitlab.freedesktop.org/wlroots/wlroots/-/blob/master/render/vulkan/texture.c#L490
> 
> The one question I have is whether this could lead to some exploit tools,
> because at least the android conformance test suite verifies that kcmp
> isn't available to apps (which is where we need it, because even with all
> the binder-based isolation gpu userspace still all run in the application
> process due to performance reasons, any ipc at all is just too much).
> 
> Otoh if we just add this to drm fd as an ioctl somewhere, then it will
> also be available to every android app because they all do need the gpu
> for rendering. So going with the full generic fcntl is probably best.
> -Sima

fcntl() will call security_file_fcntl(). IIRC, Android uses selinux and
I'm pretty certain they'd disallow any fcntl() operations they deems
unsafe. So a kernel update for them would likely require allow-listing
the new fcntl(). Or if they do allow all new fnctl()s by default they'd
have to disallow it if they thought that's an issue but really I don't
even think there's any issue in that.

I think kcmp() is a different problem because you can use it to compare
objects from different tasks. The generic fcntl() wouldn't allow that.

Re: [Linaro-mm-sig] Re: [PATCH] epoll: try to be a _bit_ better about file lifetimes

2024-05-08 Thread Christian Brauner

On Mon, May 06, 2024 at 04:29:44PM +0200, Christian König wrote:
> Am 04.05.24 um 20:20 schrieb Linus Torvalds:
> > On Sat, 4 May 2024 at 08:32, Linus Torvalds
> >  wrote:
> > > Lookie here, the fundamental issue is that epoll can call '->poll()'
> > > on a file descriptor that is being closed concurrently.
> > Thinking some more about this, and replying to myself...
> > 
> > Actually, I wonder if we could *really* fix this by simply moving the
> > eventpoll_release() to where it really belongs.
> > 
> > If we did it in file_close_fd_locked(),  it would actually make a
> > *lot* more sense. Particularly since eventpoll actually uses this:
> > 
> >  struct epoll_filefd {
> >  struct file *file;
> >  int fd;
> >  } __packed;
> > 
> > ie it doesn't just use the 'struct file *', it uses the 'fd' itself
> > (for ep_find()).
> > 
> > (Strictly speaking, it should also have a pointer to the 'struct
> > files_struct' to make the 'int fd' be meaningful).
> 
> While I completely agree on this I unfortunately have to ruin the idea.
> 
> Before we had KCMP some people relied on the strange behavior of eventpoll
> to compare struct files when the fd is the same.
> 
> I just recently suggested that solution to somebody at AMD as a workaround
> when KCMP is disabled because of security hardening and I'm pretty sure I've
> seen it somewhere else as well.
> 
> So when we change that it would break (undocumented?) UAPI behavior.

I've worked on that a bit yesterday and I learned new things about epoll
and ran into some limitations.

Like, what happens if process P1 has a file descriptor registered in an
epoll instance and now P1 forks and creates P2. So every file that P1
maintains gets copied into a new file descriptor table for P2. And the
same file descriptors refer to the same files for both P1 and P2.

So there's two interesting cases here:

(1) P2 explicitly removes the file descriptor from the epoll instance
via epoll_ctl(EPOLL_CTL_DEL). That removal affects both P1 and P2
since the  pair is only registered once and it isn't
marked whether it belongs to P1 and P2 fdtable.

So effectively fork()ing with epoll creates a weird shared state
where removal of file descriptors that were registered before the
fork() affects both child and parent.

I found that surprising even though I've worked with epoll quite
extensively in low-level userspace.

(2) P2 doesn't close it's file descriptors. It just exits. Since removal
of the file descriptor from the epoll instance isn't done during
close() but during last fput() P1's epoll state remains unaffected
by P2's sloppy exit because P1 still holds references to all files
in its fdtable.

(Sidenote, if one ends up adding every more duped-fds into epoll
instance that one doesn't explicitly close and all of them refer to
the same file wouldn't one just be allocating new epitems that
are kept around for a really long time?)

So if the removal of the fd would now be done during close() or during
exit_files() when we call close_files() and since there's currently no
way of differentiating whether P1 or P2 own that fd it would mean that
(2) collapses into (1) and we'd always alter (1)'s epoll state. That
would be a UAPI break.

So say we record the fdtable to get ownership of that file descriptor so
P2 doesn't close anything in (2) that really belongs to P1 to fix that
problem.

But afaict, that would break another possible use-case. Namely, where P1
creates an epoll instance and registeres fds and then fork()s to create
P2. Now P1 can exit and P2 takes over the epoll loop of P1. This
wouldn't work anymore because P1 would deregister all fds it owns in
that epoll instance during exit. I didn't see an immediate nice way of
fixing that issue.

But note that taking over an epoll loop from the parent doesn't work
reliably for some file descriptors. Consider man signalfd(2):

   epoll(7) semantics
   If a process adds (via epoll_ctl(2)) a signalfd file descriptor to an 
epoll(7) instance,
   then epoll_wait(2) returns events only for signals sent to that process. 
 In particular,
   if  the process then uses fork(2) to create a child process, then the 
child will be able
   to read(2) signals that  are  sent  to  it  using  the  signalfd  file  
descriptor,  but
   epoll_wait(2)  will  not  indicate  that the signalfd file descriptor is 
ready.  In this
   scenario, a possible workaround is that after the fork(2), the child 
process  can  close
   the  signalfd  file descriptor that it inherited from the parent process 
and then create
   another signalfd file descriptor and add it to the epoll instance.   
Alternatively,  the
   parent and the child could delay creating their (separate) signalfd file 
descriptors and
   adding them to the epoll instance until after the call to fork(2).

So effectively P1 opens a signalfd and registers it in an epoll
instance. Then

Re: [Linaro-mm-sig] Re: [PATCH] epoll: try to be a _bit_ better about file lifetimes

2024-05-08 Thread Christian Brauner

On Tue, May 07, 2024 at 07:45:02PM +0200, Christian König wrote:
> Am 07.05.24 um 18:46 schrieb Linus Torvalds:
> > On Tue, 7 May 2024 at 04:03, Daniel Vetter  wrote:
> > > It's really annoying that on some distros/builds we don't have that, and
> > > for gpu driver stack reasons we _really_ need to know whether a fd is the
> > > same as another, due to some messy uniqueness requirements on buffer
> > > objects various drivers have.
> > It's sad that such a simple thing would require two other horrid
> > models (EPOLL or KCMP).
> > 
> > There'[s a reason that KCMP is a config option - *some* of that is
> > horrible code - but the "compare file descriptors for equality" is not
> > that reason.
> > 
> > Note that KCMP really is a broken mess. It's also a potential security
> > hole, even for the simple things, because of how it ends up comparing
> > kernel pointers (ie it doesn't just say "same file descriptor", it
> > gives an ordering of them, so you can use KCMP to sort things in
> > kernel space).
> > 
> > And yes, it orders them after obfuscating the pointer, but it's still
> > not something I would consider sane as a baseline interface. It was
> > designed for checkpoint-restore, it's the wrong thing to use for some
> > "are these file descriptors the same".
> > 
> > The same argument goes for using EPOLL for that. Disgusting hack.
> > 
> > Just what are the requirements for the GPU stack? Is one of the file
> > descriptors "trusted", IOW, you know what kind it is?
> > 
> > Because dammit, it's *so* easy to do. You could just add a core DRM
> > ioctl for it. Literally just
> > 
> >  struct fd f1 = fdget(fd1);
> >  struct fd f2 = fdget(fd2);
> >  int same;
> > 
> >  same = f1.file && f1.file == f2.file;
> >  fdput(fd1);
> >  fdput(fd2);
> >  return same;
> > 
> > where the only question is if you also woudl want to deal with O_PATH
> > fd's, in which case the "fdget()" would be "fdget_raw()".
> > 
> > Honestly, adding some DRM ioctl for this sounds hacky, but it sounds
> > less hacky than relying on EPOLL or KCMP.
> > 
> > I'd be perfectly ok with adding a generic "FISAME" VFS level ioctl
> > too, if this is possibly a more common thing. and not just DRM wants
> > it.
> > 
> > Would something like that work for you?
> 
> Well the generic approach yes, the DRM specific one maybe. IIRC we need to
> be able to compare both DRM as well as DMA-buf file descriptors.
> 
> The basic problem userspace tries to solve is that drivers might get the
> same fd through two different code paths.
> 
> For example application using OpenGL/Vulkan for rendering and VA-API for
> video decoding/encoding at the same time.
> 
> Both APIs get a fd which identifies the device to use. It can be the same,
> but it doesn't have to.
> 
> If it's the same device driver connection (or in kernel speak underlying
> struct file) then you can optimize away importing and exporting of buffers
> for example.
> 
> Additional to that it makes cgroup accounting much easier because you don't
> count things twice because they are shared etc...

One thing to keep in mind is that a generic VFS level comparing function
will only catch the obvious case where you have dup() equivalency as
outlined above by Linus. That's what most people are interested in and
that could easily replace most kcmp() use-cases for comparing fds.

But, of course there's the case where you have two file descriptors
referring to two different files that reference the same underlying
object (usually stashed in file->private_data).

For most cases that problem can ofc be solved by comparing the
underlying inode. But that doesn't work for drivers using the generic
anonymous inode infrastructure because it uses the same inode for
everything or for cases where the same underlying object can even be
represented by different inodes.

So for such cases a driver specific ioctl() to compare two fds will
be needed in addition to the generic helper.

Re: [Linaro-mm-sig] Re: [PATCH] epoll: try to be a _bit_ better about file lifetimes

2024-05-08 Thread Christian Brauner

On Tue, May 07, 2024 at 12:07:10PM -0700, Linus Torvalds wrote:
> On Tue, 7 May 2024 at 11:04, Daniel Vetter  wrote:
> >
> > On Tue, May 07, 2024 at 09:46:31AM -0700, Linus Torvalds wrote:
> >
> > > I'd be perfectly ok with adding a generic "FISAME" VFS level ioctl
> > > too, if this is possibly a more common thing. and not just DRM wants
> > > it.
> > >
> > > Would something like that work for you?
> >
> > Yes.
> >
> > Adding Simon and Pekka as two of the usual suspects for this kind of
> > stuff. Also example code (the int return value is just so that callers know
> > when kcmp isn't available, they all only care about equality):
> >
> > https://gitlab.freedesktop.org/mesa/mesa/-/blob/main/src/util/os_file.c#L239
> 
> That example thing shows that we shouldn't make it a FISAME ioctl - we
> should make it a fcntl() instead, and it would just be a companion to
> F_DUPFD.
> 
> Doesn't that strike everybody as a *much* cleaner interface? I think

+1
See
https://github.com/systemd/systemd/blob/a4f0e0da3573a10bc5404142be8799418760b1d1/src/basic/fd-util.c#L517
that's another heavy user of this kind of functionality.

Re: [PATCH] epoll: try to be a _bit_ better about file lifetimes

2024-05-06 Thread Christian Brauner

On Mon, May 06, 2024 at 02:47:23PM +0200, Daniel Vetter wrote:
> On Sun, May 05, 2024 at 01:53:48PM -0700, Linus Torvalds wrote:
> > On Sun, 5 May 2024 at 13:30, Al Viro  wrote:
> > >
> > > 0.  special-cased ->f_count rule for ->poll() is a wart and it's
> > > better to get rid of it.
> > >
> > > 1.  fs/eventpoll.c is a steaming pile of shit and I'd be glad to see
> > > git rm taken to it.  Short of that, by all means, let's grab reference
> > > in there around the call of vfs_poll() (see (0)).
> > 
> > Agreed on 0/1.
> > 
> > > 2.  having ->poll() instances grab extra references to file passed
> > > to them is not something that should be encouraged; there's a plenty
> > > of potential problems, and "caller has it pinned, so we are fine with
> > > grabbing extra refs" is nowhere near enough to eliminate those.
> > 
> > So it's not clear why you hate it so much, since those extra
> > references are totally normal in all the other VFS paths.
> > 
> > I mean, they are perhaps not the *common* case, but we have a lot of
> > random get_file() calls sprinkled around in various places when you
> > end up passing a file descriptor off to some asynchronous operation
> > thing.
> > 
> > Yeah, I think most of them tend to be special operations (eg the tty
> > TIOCCONS ioctl to redirect the console), but it's not like vfs_ioctl()
> > is *that* different from vfs_poll. Different operation, not somehow
> > "one is more special than the other".
> > 
> > cachefiles and backing-file does it for regular IO, and drop it at IO
> > completion - not that different from what dma-buf does. It's in
> > ->read_iter() rather than ->poll(), but again: different operations,
> > but not "one of them is somehow fundamentally different".
> > 
> > > 3.  dma-buf uses of get_file() are probably safe (epoll shite aside),
> > > but they do look fishy.  That has nothing to do with epoll.
> > 
> > Now, what dma-buf basically seems to do is to avoid ref-counting its
> > own fundamental data structure, and replaces that by refcounting the
> > 'struct file' that *points* to it instead.
> > 
> > And it is a bit odd, but it actually makes some amount of sense,
> > because then what it passes around is that file pointer (and it allows
> > passing it around from user space *as* that file).
> > 
> > And honestly, if you look at why it then needs to add its refcount to
> > it all, it actually makes sense.  dma-bufs have this notion of
> > "fences" that are basically completion points for the asynchronous
> > DMA. Doing a "poll()" operation will add a note to the fence to get
> > that wakeup when it's done.
> > 
> > And yes, logically it takes a ref to the "struct dma_buf", but because
> > of how the lifetime of the dma_buf is associated with the lifetime of
> > the 'struct file', that then turns into taking a ref on the file.
> > 
> > Unusual? Yes. But not illogical. Not obviously broken. Tying the
> > lifetime of the dma_buf to the lifetime of a file that is passed along
> > makes _sense_ for that use.
> > 
> > I'm sure dma-bufs could add another level of refcounting on the
> > 'struct dma_buf' itself, and not make it be 1:1 with the file, but
> > it's not clear to me what the advantage would really be, or why it
> > would be wrong to re-use a refcount that is already there.
> 
> So there is generally another refcount, because dma_buf is just the
> cross-driver interface to some kind of real underlying buffer object from
> the various graphics related subsystems we have.
> 
> And since it's a pure file based api thing that ceases to serve any
> function once the fd/file is gone we tied all the dma_buf refcounting to
> the refcount struct file already maintains. But the underlying buffer
> object can easily outlive the dma_buf, and over the lifetime of an
> underlying buffer object you might actually end up creating different
> dma_buf api wrappers for it (but at least in drm we guarantee there's at
> most one, hence why vmwgfx does the atomic_inc_unless_zero trick, which I
> don't particularly like and isn't really needed).
> 
> But we could add another refcount, it just means we have 3 of those then
> when only really 2 are needed.

Fwiw, the TTM thing described upthread and in the other thread really
tries hard to work around the dma_buf == file lifetime choice by hooking
into the dma-buf specific release function so it can access the dmabuf
and then the file. All that seems like a pretty error prone thing to me.
So a separate refcount for dma_buf wouldn't be the worst as that would
allow that TTM thing to benefit and remove that nasty hacking into your
generic dma_buf ops. But maybe I'm the only one who sees it that way and
I'm certainly not familiar enough with dma-buf.

Re: [PATCH] epoll: try to be a _bit_ better about file lifetimes

2024-05-06 Thread Christian Brauner

On Mon, May 06, 2024 at 11:27:04AM +0200, Christian Brauner wrote:
> On Mon, May 06, 2024 at 10:45:35AM +0200, Christian Brauner wrote:
> > > The fact is, it's not dma-buf that is violating any rules. It's epoll.
> > 
> > I agree that epoll() not taking a reference on the file is at least
> > unexpected and contradicts the usual code patterns for the sake of
> > performance and that it very likely is the case that most callers of
> > f_op->poll() don't know this.
> > 
> > Note, I cleary wrote upthread that I'm ok to do it like you suggested
> > but raised two concerns a) there's currently only one instance of
> > prolonged @file lifetime in f_op->poll() afaict and b) that there's
> > possibly going to be some performance impact on epoll().
> > 
> > So it's at least worth discussing what's more important because epoll()
> > is very widely used and it's not that we haven't favored performance
> > before.
> > 
> > But you've already said that you aren't concerned with performance on
> > epoll() upthread. So afaict then there's really not a lot more to
> > discuss other than take the patch and see whether we get any complaints.
> 
> Two closing thoughts:
> 
> (1) I wonder if this won't cause userspace regressions for the semantics
> of epoll because dying files are now silently ignored whereas before
> they'd generated events.
> 
> (2) The other part is that this seems to me that epoll() will now
> temporarly pin filesystems opening up the possibility for spurious
> EBUSY errors.
> 
> If you register a file descriptor in an epoll instance and then
> close it and umount the filesystem but epoll managed to do an fget()
> on that fd before that close() call then epoll will pin that
> filesystem.
> 
> If the f_op->poll() method does something that can take a while
> (blocks on a shared mutex of that subsystem) that umount is very
> likely going to return EBUSY suddenly.
> 
> Afaict, before that this wouldn't have been an issue at all and is
> likely more serious than performance.
> 
> (One option would be to only do epi_fget() for stuff like
> dma-buf that's never unmounted. That'll cover nearly every
> driver out there. Only "real" filesystems would have to contend with
> @file count going to zero but honestly they also deal with dentry
> lookup under RCU which is way more adventurous than this.)
> 
> Maybe I'm barking up the wrong tree though.

Sorry, had to step out for an appointment.

Under the assumption that I'm not entirely off with this - and I really
could be ofc - then one possibility would be that we enable persistence
of files from f_op->poll() for SB_NOUSER filesystems.

That'll catch everything that's relying on anonymous inodes (drm and all
drivers) and init_pseudo() so everything that isn't actually unmountable
(pipefs, pidfs, sockfs, etc.).

So something like the _completely untested_ diff on top of your proposal
above:

diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index a3f0f868adc4..95968a462544 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -1018,8 +1018,24 @@ static struct file *epi_fget(const struct epitem *epi)
 static __poll_t ep_item_poll(const struct epitem *epi, poll_table *pt,
 int depth)
 {
-   struct file *file = epi_fget(epi);
+   struct file *file = epi->ffd.file;
__poll_t res;
+   bool unrefd = false;
+
+   /*
+* Taking a reference for anything that isn't mountable is fine
+* because we don't have to worry about spurious EBUSY warnings
+* from umount().
+*
+* File count might go to zero in f_op->poll() for mountable
+* filesystems.
+*/
+   if (file->f_inode->i_sb->s_flags & SB_NOUSER) {
+   unrefd = true;
+   file = epi_fget(epi);
+   } else if (file_count(file) == 0) {
+   file = NULL;
+   }

/*
 * We could return EPOLLERR | EPOLLHUP or something,
@@ -1034,7 +1050,9 @@ static __poll_t ep_item_poll(const struct epitem *epi, 
poll_table *pt,
res = vfs_poll(file, pt);
else
res = __ep_eventpoll_poll(file, pt, depth);
-   fput(file);
+
+   if (unrefd)
+   fput(file);
return res & epi->event.events;
 }

Basically, my worry is that we end up with really annoying to debug
EBUSYs caused by epoll(). I'd really like to avoid that. But again, I
might be wrong and this isn't an issue.

Re: [PATCH] epoll: try to be a _bit_ better about file lifetimes

2024-05-06 Thread Christian Brauner

On Mon, May 06, 2024 at 10:45:35AM +0200, Christian Brauner wrote:
> > The fact is, it's not dma-buf that is violating any rules. It's epoll.
> 
> I agree that epoll() not taking a reference on the file is at least
> unexpected and contradicts the usual code patterns for the sake of
> performance and that it very likely is the case that most callers of
> f_op->poll() don't know this.
> 
> Note, I cleary wrote upthread that I'm ok to do it like you suggested
> but raised two concerns a) there's currently only one instance of
> prolonged @file lifetime in f_op->poll() afaict and b) that there's
> possibly going to be some performance impact on epoll().
> 
> So it's at least worth discussing what's more important because epoll()
> is very widely used and it's not that we haven't favored performance
> before.
> 
> But you've already said that you aren't concerned with performance on
> epoll() upthread. So afaict then there's really not a lot more to
> discuss other than take the patch and see whether we get any complaints.

Two closing thoughts:

(1) I wonder if this won't cause userspace regressions for the semantics
of epoll because dying files are now silently ignored whereas before
they'd generated events.

(2) The other part is that this seems to me that epoll() will now
temporarly pin filesystems opening up the possibility for spurious
EBUSY errors.

If you register a file descriptor in an epoll instance and then
close it and umount the filesystem but epoll managed to do an fget()
on that fd before that close() call then epoll will pin that
filesystem.

If the f_op->poll() method does something that can take a while
(blocks on a shared mutex of that subsystem) that umount is very
likely going to return EBUSY suddenly.

Afaict, before that this wouldn't have been an issue at all and is
likely more serious than performance.

(One option would be to only do epi_fget() for stuff like
dma-buf that's never unmounted. That'll cover nearly every
driver out there. Only "real" filesystems would have to contend with
@file count going to zero but honestly they also deal with dentry
lookup under RCU which is way more adventurous than this.)

Maybe I'm barking up the wrong tree though.

Re: [PATCH] epoll: try to be a _bit_ better about file lifetimes

2024-05-06 Thread Christian Brauner

> The fact is, it's not dma-buf that is violating any rules. It's epoll.

I agree that epoll() not taking a reference on the file is at least
unexpected and contradicts the usual code patterns for the sake of
performance and that it very likely is the case that most callers of
f_op->poll() don't know this.

Note, I cleary wrote upthread that I'm ok to do it like you suggested
but raised two concerns a) there's currently only one instance of
prolonged @file lifetime in f_op->poll() afaict and b) that there's
possibly going to be some performance impact on epoll().

So it's at least worth discussing what's more important because epoll()
is very widely used and it's not that we haven't favored performance
before.

But you've already said that you aren't concerned with performance on
epoll() upthread. So afaict then there's really not a lot more to
discuss other than take the patch and see whether we get any complaints.

Re: [PATCH] epoll: try to be a _bit_ better about file lifetimes

2024-05-05 Thread Christian Brauner

On Sat, May 04, 2024 at 08:40:25AM -0700, Linus Torvalds wrote:
> On Sat, 4 May 2024 at 08:32, Linus Torvalds
>  wrote:
> >
> > Now, during this TOTALLY INNOCENT sock_poll(), in another thread, the
> > file closing completes, eventpoll_release() finishes [..]
> 
> Actually, Al is right that ep_item_poll() should be holding the
> ep->mtx, so eventpoll_release() -> eventpoll_release_file_file() ->
> mutex_lock(>mtx) should block and the file doesn't actually get
> released.

So I know you've seen it yourself but for my own peace of mind I've said
that in the other mail and in the other thread already that all callers
of ep_item_poll() do already hold the ep->mtx:

do_epoll_ctl()
-> epoll_mutex_lock(>mtx)
-> ep_insert()
   -> ep_item_poll()

do_epoll_ctl()
-> epoll_mutex_lock(>mtx)
-> ep_modify()
   -> ep_item_poll()

ep_send_events()
-> mutex_lock(>mtx)
-> ep_item_poll()

/* nested; and all callers of ep_item_poll() already hold ep->mtx */
__ep_eventpoll_poll()
-> mutex_lock_nested(>mtx, wait)
-> ep_item_poll()

So it's simply not possible to end up with a UAF in f_op->poll() because
eventpoll_release_file_file() serializes on ep->mtx as well:

__fput()
-> eventpoll_release()
   -> eventpoll_release_file()
  {
  // @file->f_count is zero _but file is not freed_
  // so taking file->f_lock is absolutely fine
  spin_lock(>f_lock);
  // mark as dying

  // serialzed on ep->mtx
  mutex_lock(>mtx);
  __ep_rmove(ep, epi);
  ...

  }
  -> mutex_lock(>mtx)

-> f_op->release()
-> kfree(file)

So afaict it's simply not possible to end up with a UAF in f_op->poll().

And I agree with you that for some instances it's valid to take another
reference to a file from f_op->poll() but then they need to use
get_file_active() imho and simply handle the case where f_count is zero.
And we need to document that in Documentation/filesystems/file.rst or
locking.rst.

But if it's simply just dma buf that cares about that long-term
reference then really we should just force them to take the reference
like I suggested but don't penalize everyone else. When I took a glance
at all f_op->poll() implementations I didn't spot one that did take
extra references.

But if you absolutely want to have epoll take the reference before it
calls into f_op->poll() that's fine with me as well. But we might end up
forcing epoll to do a lot of final fput()s which I'm not sure is all
that desirable.

Re: [PATCH] epoll: try to be a _bit_ better about file lifetimes

2024-05-04 Thread Christian Brauner

On Sat, May 04, 2024 at 12:39:00AM +0100, Al Viro wrote:
> On Fri, May 03, 2024 at 04:16:15PM -0700, Linus Torvalds wrote:
> > On Fri, 3 May 2024 at 15:07, Al Viro  wrote:
> > >
> > > Suppose your program calls select() on a pipe and dmabuf, sees data to be 
> > > read
> > > from pipe, reads it, closes both pipe and dmabuf and exits.
> > >
> > > Would you expect that dmabuf file would stick around for hell knows how 
> > > long
> > > after that?  I would certainly be very surprised by running into that...
> > 
> > Why?
> > 
> > That's the _point_ of refcounts. They make the thing they refcount
> > stay around until it's no longer referenced.
> > 
> > Now, I agree that dmabuf's are a bit odd in how they use a 'struct
> > file' *as* their refcount, but hey, it's a specialty use. Unusual
> > perhaps, but not exactly wrong.
> > 
> > I suspect that if you saw a dmabuf just have its own 'refcount_t' and
> > stay around until it was done, you wouldn't bat an eye at it, and it's
> > really just the "it uses a struct file for counting" that you are
> > reacting to.
> 
> *IF* those files are on purely internal filesystem, that's probably
> OK; do that with something on something mountable (char device,
> sysfs file, etc.) and you have a problem with filesystem staying
> busy.

In this instance it is ok because dma-buf is an internal fs. I had the
exact same reaction you had initially but it doesn't matter for dma-buf
afaict as that thing can never be unmounted.

Re: get_file() unsafe under epoll (was Re: [syzbot] [fs?] [io-uring?] general protection fault in __ep_remove)

2024-05-04 Thread Christian Brauner

On Fri, May 03, 2024 at 12:59:52PM -0700, Kees Cook wrote:
> On Fri, May 03, 2024 at 01:35:09PM -0600, Jens Axboe wrote:
> > On 5/3/24 1:22 PM, Kees Cook wrote:
> > > On Fri, May 03, 2024 at 12:49:11PM -0600, Jens Axboe wrote:
> > >> On 5/3/24 12:26 PM, Kees Cook wrote:
> > >>> Thanks for doing this analysis! I suspect at least a start of a fix
> > >>> would be this:
> > >>>
> > >>> diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c
> > >>> index 8fe5aa67b167..15e8f74ee0f2 100644
> > >>> --- a/drivers/dma-buf/dma-buf.c
> > >>> +++ b/drivers/dma-buf/dma-buf.c
> > >>> @@ -267,9 +267,8 @@ static __poll_t dma_buf_poll(struct file *file, 
> > >>> poll_table *poll)
> > >>>  
> > >>> if (events & EPOLLOUT) {
> > >>> /* Paired with fput in dma_buf_poll_cb */
> > >>> -   get_file(dmabuf->file);
> > >>> -
> > >>> -   if (!dma_buf_poll_add_cb(resv, true, dcb))
> > >>> +   if (!atomic_long_inc_not_zero(>file) &&
> > >>> +   !dma_buf_poll_add_cb(resv, true, dcb))
> > >>> /* No callback queued, wake up any 
> > >>> other waiters */
> > >>
> > >> Don't think this is sane at all. I'm assuming you meant:
> > >>
> > >>  atomic_long_inc_not_zero(>file->f_count);
> > > 
> > > Oops, yes, sorry. I was typed from memory instead of copy/paste.
> > 
> > Figured :-)
> > 
> > >> but won't fly as you're not under RCU in the first place. And what
> > >> protects it from being long gone before you attempt this anyway? This is
> > >> sane way to attempt to fix it, it's completely opposite of what sane ref
> > >> handling should look like.
> > >>
> > >> Not sure what the best fix is here, seems like dma-buf should hold an
> > >> actual reference to the file upfront rather than just stash a pointer
> > >> and then later _hope_ that it can just grab a reference. That seems
> > >> pretty horrible, and the real source of the issue.
> > > 
> > > AFAICT, epoll just doesn't hold any references at all. It depends,
> > > I think, on eventpoll_release() (really eventpoll_release_file())
> > > synchronizing with epoll_wait() (but I don't see how this happens, and
> > > the race seems to be against ep_item_poll() ...?)
> > >
> > > I'm really confused about how eventpoll manages the lifetime of polled
> > > fds.
> > 
> > epoll doesn't hold any references, and it's got some ugly callback to
> > deal with that. It's not ideal, nor pretty, but that's how it currently
> > works. See eventpoll_release() and how it's called. This means that
> > epoll itself is supposedly safe from the file going away, even though it
> > doesn't hold a reference to it.
> 
> Right -- what remains unclear to me is how struct file lifetime is
> expected to work in the struct file_operations::poll callbacks. Because
> using get_file() there looks clearly unsafe...

If you're in ->poll() you're holding the epoll mutex and
eventpoll_release_file() needs to acquire ep->mtx as well. So if you're
in ->poll() then you know that eventpoll_release_file() can't progress
and therefore eventpoll_release() can't make progress. So
f_op->release() won't be able to be called as it happens after
eventpoll_release() in __fput(). But f_count being able to go to zero is
expected.

Re: get_file() unsafe under epoll (was Re: [syzbot] [fs?] [io-uring?] general protection fault in __ep_remove)

2024-05-04 Thread Christian Brauner

On Fri, May 03, 2024 at 11:26:32AM -0700, Kees Cook wrote:
> On Fri, May 03, 2024 at 06:54:22PM +0700, Bui Quang Minh wrote:
> > [...]
> > Root cause:
> > AFAIK, eventpoll (epoll) does not increase the registered file's reference.
> > To ensure the safety, when the registered file is deallocated in __fput,
> > eventpoll_release is called to unregister the file from epoll. When calling
> > poll on epoll, epoll will loop through registered files and call vfs_poll on
> > these files. In the file's poll, file is guaranteed to be alive, however, as
> > epoll does not increase the registered file's reference, the file may be
> > dying
> > and it's not safe the get the file for later use. And dma_buf_poll violates
> > this. In the dma_buf_poll, it tries to get_file to use in the callback. This
> > leads to a race where the dmabuf->file can be fput twice.
> > 
> > Here is the race occurs in the above proof-of-concept
> > 
> > close(dmabuf->file)
> > __fput_sync (f_count == 1, last ref)
> > f_count-- (f_count == 0 now)
> > __fput
> > epoll_wait
> > vfs_poll(dmabuf->file)
> > get_file(dmabuf->file)(f_count == 1)
> > eventpoll_release
> > dmabuf->file deallocation
> > fput(dmabuf->file) (f_count == 1)
> > f_count--
> > dmabuf->file deallocation
> > 
> > I am not familiar with the dma_buf so I don't know the proper fix for the
> > issue. About the rule that don't get the file for later use in poll callback
> > of
> > file, I wonder if it is there when only select/poll exist or just after
> > epoll
> > appears.
> > 
> > I hope the analysis helps us to fix the issue.
> 
> Thanks for doing this analysis! I suspect at least a start of a fix
> would be this:
> 
> diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c
> index 8fe5aa67b167..15e8f74ee0f2 100644
> --- a/drivers/dma-buf/dma-buf.c
> +++ b/drivers/dma-buf/dma-buf.c
> @@ -267,9 +267,8 @@ static __poll_t dma_buf_poll(struct file *file, 
> poll_table *poll)
>  
>   if (events & EPOLLOUT) {
>   /* Paired with fput in dma_buf_poll_cb */
> - get_file(dmabuf->file);
> -
> - if (!dma_buf_poll_add_cb(resv, true, dcb))
> + if (!atomic_long_inc_not_zero(>file) &&
> + !dma_buf_poll_add_cb(resv, true, dcb))
>   /* No callback queued, wake up any other 
> waiters */
>   dma_buf_poll_cb(NULL, >cb);
>   else
> @@ -290,9 +289,8 @@ static __poll_t dma_buf_poll(struct file *file, 
> poll_table *poll)
>  
>   if (events & EPOLLIN) {
>   /* Paired with fput in dma_buf_poll_cb */
> - get_file(dmabuf->file);
> -
> - if (!dma_buf_poll_add_cb(resv, false, dcb))
> + if (!atomic_long_inc_not_zero(>file) &&
> + !dma_buf_poll_add_cb(resv, false, dcb))
>   /* No callback queued, wake up any other 
> waiters */
>   dma_buf_poll_cb(NULL, >cb);
>   else
> 
> 
> But this ends up leaving "active" non-zero, and at close time it runs
> into:
> 
> BUG_ON(dmabuf->cb_in.active || dmabuf->cb_out.active);
> 
> But the bottom line is that get_file() is unsafe to use in some places,
> one of which appears to be in the poll handler. There are maybe some
> other fragile places too, like in drivers/gpu/drm/vmwgfx/ttm_object.c:
> 
> static bool __must_check get_dma_buf_unless_doomed(struct dma_buf *dmabuf)
> {
>   return atomic_long_inc_not_zero(>file->f_count) != 0L;
> }
> 
> Which I also note involves a dmabuf...
> 
> Due to this issue I've proposed fixing get_file() to detect pathological 
> states:
> https://lore.kernel.org/lkml/2024050252.work.690-k...@kernel.org/
> 
> But that has run into some push-back. I'm hoping that seeing this epoll
> example will help illustrate what needs fixing a little better.
> 
> I think the best current proposal is to just WARN sooner instead of a
> full refcount_t implementation:
> 
> 
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 8dfd53b52744..e09107d0a3d6 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -1040,7 +1040,8 @@ struct file_handle {
>  
>  static inline struct file *get_file(struct file *f)
>  {
> - atomic_long_inc(>f_count);
> + long prior = atomic_long_fetch_inc_relaxed(>f_count);
> + WARN_ONCE(!prior, "struct file::f_count incremented from zero; 
> use-after-free condition present!\n");
>   return f;
>  }
>  
> 
> 
> What's the right way to deal with the dmabuf situation? (And I suspect
> it applies to get_dma_buf_unless_doomed() as well...)

No, it doesn't. That's safe afaict as I've

Re: [PATCH] fs: WARN when f_count resurrection is attempted

2024-05-04 Thread Christian Brauner

On Fri, 03 May 2024 13:16:25 -0700, Kees Cook wrote:
> It should never happen that get_file() is called on a file with
> f_count equal to zero. If this happens, a use-after-free condition
> has happened[1], and we need to attempt a best-effort reporting of
> the situation to help find the root cause more easily. Additionally,
> this serves as a data corruption indicator that system owners using
> warn_limit or panic_on_warn would like to have detected.
> 
> [...]

Applied to the vfs.misc branch of the vfs/vfs.git tree.
Patches in the vfs.misc branch should appear in linux-next soon.

Please report any outstanding bugs that were missed during review in a
new review to the original patch series allowing us to drop it.

It's encouraged to provide Acked-bys and Reviewed-bys even though the
patch has now been applied. If possible patch trailers will be updated.

Note that commit hashes shown below are subject to change due to rebase,
trailer updates or similar. If in doubt, please check the listed branch.

tree:   https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git
branch: vfs.misc

[1/1] fs: WARN when f_count resurrection is attempted
  https://git.kernel.org/vfs/vfs/c/f6bdc7865ef4

Re: [PATCH] epoll: try to be a _bit_ better about file lifetimes

2024-05-04 Thread Christian Brauner

On Fri, May 03, 2024 at 02:33:37PM -0700, Linus Torvalds wrote:
> On Fri, 3 May 2024 at 14:24, Al Viro  wrote:
> >
> > Can we get to ep_item_poll(epi, ...) after eventpoll_release_file()
> > got past __ep_remove()?  Because if we can, we have a worse problem -
> > epi freed under us.
> 
> Look at the hack in __ep_remove(): if it is concurrent with
> eventpoll_release_file(), it will hit this code
> 
> spin_lock(>f_lock);
> if (epi->dying && !force) {
> spin_unlock(>f_lock);
> return false;
> }
> 
> and not free the epi.
> 
> But as far as I can tell, almost nothing else cares about the f_lock
> and dying logic.
> 
> And in fact, I don't think doing
> 
> spin_lock(>f_lock);
> 
> is even valid in the places that look up file through "epi->ffd.file",
> because the lock itself is inside the thing that you can't trust until
> you've taken the lock...
> 
> So I agree with Kees about the use of "atomic_dec_not_zero()" kind of
> logic - but it also needs to be in an RCU-readlocked region, I think.

Why isn't it enough to just force dma_buf_poll() to use
get_file_active()? Then that whole problem goes away afaict.

So the fix I had yesterday before I had to step away from the computer
was literally just that [1]. It currently uses two atomic incs
potentially but that can probably be fixed by the dma folks to be
smarter about when they actually need to take a file reference.

> 
> I wish epoll() just took the damn file ref itself. But since it relies
> on the file refcount to release the data structure, that obviously
> can't work.
> 
> Linus

diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c
index 8fe5aa67b167..7149c45976e1 100644
--- a/drivers/dma-buf/dma-buf.c
+++ b/drivers/dma-buf/dma-buf.c
@@ -244,13 +244,18 @@ static __poll_t dma_buf_poll(struct file *file, 
poll_table *poll)
if (!dmabuf || !dmabuf->resv)
return EPOLLERR;

+   if (!get_file_active(>file))
+   return EPOLLERR;
+
resv = dmabuf->resv;

poll_wait(file, >poll, poll);

events = poll_requested_events(poll) & (EPOLLIN | EPOLLOUT);
-   if (!events)
+   if (!events) {
+   fput(file);
return 0;
+   }

dma_resv_lock(resv, NULL);

@@ -268,7 +273,6 @@ static __poll_t dma_buf_poll(struct file *file, poll_table 
*poll)
if (events & EPOLLOUT) {
/* Paired with fput in dma_buf_poll_cb */
get_file(dmabuf->file);
-
if (!dma_buf_poll_add_cb(resv, true, dcb))
/* No callback queued, wake up any other 
waiters */
dma_buf_poll_cb(NULL, >cb);
@@ -301,6 +305,7 @@ static __poll_t dma_buf_poll(struct file *file, poll_table 
*poll)
}

dma_resv_unlock(resv);
+   fput(file);
return events;
 }

Re: [PATCH] epoll: try to be a _bit_ better about file lifetimes

2024-05-04 Thread Christian Brauner

On Fri, May 03, 2024 at 04:41:19PM -0700, Linus Torvalds wrote:
> On Fri, 3 May 2024 at 16:23, Kees Cook  wrote:
> >
> > static bool __must_check get_dma_buf_unless_doomed(struct dma_buf *dmabuf)
> > {
> > return atomic_long_inc_not_zero(>file->f_count) != 0L;
> > }
> >
> > If we end up adding epi_fget(), we'll have 2 cases of using
> > "atomic_long_inc_not_zero" for f_count. Do we need some kind of blessed
> > helper to live in file.h or something, with appropriate comments?
> 
> I wonder if we could try to abstract this out a bit more.
> 
> These games with non-ref-counted file structures *feel* a bit like the
> games we play with non-ref-counted (aka "stashed") 'struct dentry'
> that got fairly recently cleaned up with path_from_stashed() when both
> nsfs and pidfs started doing the same thing.
> 
> I'm not loving the TTM use of this thing, but at least the locking and
> logic feels a lot more straightforward (ie the
> atomic_long_inc_not_zero() here is clealy under the 'prime->mutex'
> lock

The TTM stuff is somewhat wild though and I've commented on that in
https://lore.kernel.org/r/20240503-mitmachen-redakteur-2707ab0cacc3@brauner
another thread that it can just use get_active_file().

Afaict, there's dma_buf_export() that allocates a new file and sets:

file->private_data = dmabuf;
dmabuf->file = file;
dentry->d_fsdata = dmabuf;

The file has f_op->release::dma_buf_file_release() as it's f_op->release
method. When that's called the file's refcount is already zero but the
file has not been freed yet. This will remove the dmabuf from some
public list but it won't free it.

dmabuf dentries have dma_buf_dentry_ops which use
dentry->d_release::dma_buf_release() to release the actual dmabuf
stashed in dentry->d_fsdata.

So that ends up with:

__fput()
-> f_op->release::dma_buf_file_release() // handles file specific freeing
-> dput()
   -> d_op->d_release::dma_buf_release() // handles dmabuf freeing
 // including the driver specific stuff.

If you fput() the file then the dmabuf will be freed as well immediately
after it when the dput() happens in __fput().

So that TTM thing does something else then in ttm_object_device_init().
It copies the dma_buf_ops into tdev->ops and replaces the dma_buf_ops
release method with it's own ttm_prime_dmabuf_release() and stashes the
old on in tdev->dma_buf_release.

And it uses that to hook into the release path so that @dmabuf will
still be valid for get_dma_buf_unless_doomed() under prime->mutex.

But again, get_dma_buf_unless_doomed() can just be replaced with
get_active_file() and then we're done with that part.

> IOW, the tty use looks correct to me, and it has fairly simple locking
> and is just catching the the race between 'fput()' decrementing the
> refcount and and 'file->f_op->release()' doing the actual release.
> 
> You are right that it's similar to the epoll thing in that sense, it
> just looks a _lot_ more straightforward to me (and, unlike epoll,
> doesn't look actively buggy right now).

It's not buggy afaict. It literally can just switch to get_active_file()
instead of open-coding it and we're done imho.

Re: [PATCH 5/5] fs: Convert struct file::f_count to refcount_long_t

2024-05-03 Thread Christian Brauner

On Fri, May 03, 2024 at 12:36:14PM +0200, Peter Zijlstra wrote:
> On Fri, May 03, 2024 at 11:37:25AM +0200, Christian Brauner wrote:
> > On Thu, May 02, 2024 at 05:41:23PM -0700, Kees Cook wrote:
> > > On Fri, May 03, 2024 at 01:14:45AM +0100, Al Viro wrote:
> > > > On Thu, May 02, 2024 at 05:10:18PM -0700, Kees Cook wrote:
> > > > 
> > > > > But anyway, there needs to be a general "oops I hit 0"-aware form of
> > > > > get_file(), and it seems like it should just be get_file() itself...
> > > > 
> > > > ... which brings back the question of what's the sane damage mitigation
> > > > for that.  Adding arseloads of never-exercised failure exits is 
> > > > generally
> > > > a bad idea - it's asking for bitrot and making the thing harder to 
> > > > review
> > > > in future.
> > > 
> > > Linus seems to prefer best-effort error recovery to sprinkling BUG()s
> > > around.  But if that's really the solution, then how about get_file()
> > > switching to to use inc_not_zero and BUG on 0?
> > 
> > Making get_file() return an error is not an option. For all current
> > callers that's pointless churn for a condition that's not supposed to
> > happen at all.
> > 
> > Additionally, iirc *_inc_not_zero() variants are implemented with
> > try_cmpxchg() which scales poorly under contention for a condition
> > that's not supposed to happen.
> 
>   unsigned long old = atomic_long_fetch_inc_relaxed(>f_count);
>   WARN_ON(!old);
> 
> Or somesuch might be an option?

Yeah, I'd be fine with that. WARN_ON() (or WARN_ON_ONCE() even?) and
then people can do their panic_on_warn stuff to get the BUG_ON()
behavior if they want to.

Re: [PATCH 5/5] fs: Convert struct file::f_count to refcount_long_t

2024-05-03 Thread Christian Brauner

On Fri, May 03, 2024 at 12:36:14PM +0200, Peter Zijlstra wrote:
> On Fri, May 03, 2024 at 11:37:25AM +0200, Christian Brauner wrote:
> > On Thu, May 02, 2024 at 05:41:23PM -0700, Kees Cook wrote:
> > > On Fri, May 03, 2024 at 01:14:45AM +0100, Al Viro wrote:
> > > > On Thu, May 02, 2024 at 05:10:18PM -0700, Kees Cook wrote:
> > > > 
> > > > > But anyway, there needs to be a general "oops I hit 0"-aware form of
> > > > > get_file(), and it seems like it should just be get_file() itself...
> > > > 
> > > > ... which brings back the question of what's the sane damage mitigation
> > > > for that.  Adding arseloads of never-exercised failure exits is 
> > > > generally
> > > > a bad idea - it's asking for bitrot and making the thing harder to 
> > > > review
> > > > in future.
> > > 
> > > Linus seems to prefer best-effort error recovery to sprinkling BUG()s
> > > around.  But if that's really the solution, then how about get_file()
> > > switching to to use inc_not_zero and BUG on 0?
> > 
> > Making get_file() return an error is not an option. For all current
> > callers that's pointless churn for a condition that's not supposed to
> > happen at all.
> > 
> > Additionally, iirc *_inc_not_zero() variants are implemented with
> > try_cmpxchg() which scales poorly under contention for a condition
> > that's not supposed to happen.
> 
>   unsigned long old = atomic_long_fetch_inc_relaxed(>f_count);
>   WARN_ON(!old);
> 
> Or somesuch might be an option?

Yeah, I'd be fine with that. WARN_ON() (or WARN_ON_ONCE() even?) and
then people can do their panic_on_warn stuff to get the BUG_ON()
behavior if they want to.

Re: [PATCH 5/5] fs: Convert struct file::f_count to refcount_long_t

2024-05-03 Thread Christian Brauner

On Thu, May 02, 2024 at 05:41:23PM -0700, Kees Cook wrote:
> On Fri, May 03, 2024 at 01:14:45AM +0100, Al Viro wrote:
> > On Thu, May 02, 2024 at 05:10:18PM -0700, Kees Cook wrote:
> > 
> > > But anyway, there needs to be a general "oops I hit 0"-aware form of
> > > get_file(), and it seems like it should just be get_file() itself...
> > 
> > ... which brings back the question of what's the sane damage mitigation
> > for that.  Adding arseloads of never-exercised failure exits is generally
> > a bad idea - it's asking for bitrot and making the thing harder to review
> > in future.
> 
> Linus seems to prefer best-effort error recovery to sprinkling BUG()s
> around.  But if that's really the solution, then how about get_file()
> switching to to use inc_not_zero and BUG on 0?

Making get_file() return an error is not an option. For all current
callers that's pointless churn for a condition that's not supposed to
happen at all.

Additionally, iirc *_inc_not_zero() variants are implemented with
try_cmpxchg() which scales poorly under contention for a condition
that's not supposed to happen.

Re: [PATCH 5/5] fs: Convert struct file::f_count to refcount_long_t

2024-05-03 Thread Christian Brauner

On Thu, May 02, 2024 at 05:41:23PM -0700, Kees Cook wrote:
> On Fri, May 03, 2024 at 01:14:45AM +0100, Al Viro wrote:
> > On Thu, May 02, 2024 at 05:10:18PM -0700, Kees Cook wrote:
> > 
> > > But anyway, there needs to be a general "oops I hit 0"-aware form of
> > > get_file(), and it seems like it should just be get_file() itself...
> > 
> > ... which brings back the question of what's the sane damage mitigation
> > for that.  Adding arseloads of never-exercised failure exits is generally
> > a bad idea - it's asking for bitrot and making the thing harder to review
> > in future.
> 
> Linus seems to prefer best-effort error recovery to sprinkling BUG()s
> around.  But if that's really the solution, then how about get_file()
> switching to to use inc_not_zero and BUG on 0?

Making get_file() return an error is not an option. For all current
callers that's pointless churn for a condition that's not supposed to
happen at all.

Additionally, iirc *_inc_not_zero() variants are implemented with
try_cmpxchg() which scales poorly under contention for a condition
that's not supposed to happen.

Re: [PATCH] selftests/binderfs: use the Makefile's rules, not Make's implicit rules

2024-05-03 Thread Christian Brauner

On Thu, May 02, 2024 at 06:58:20PM -0700, John Hubbard wrote:
> First of all, in order to build with clang at all, one must first apply
> Valentin Obst's build fix for LLVM [1]. Once that is done, then when
> building with clang, via:
> 
> make LLVM=1 -C tools/testing/selftests
> 
> ...the following error occurs:
> 
>clang: error: cannot specify -o when generating multiple output files
> 
> This is because clang, unlike gcc, won't accept invocations of this
> form:
> 
> clang file1.c header2.h
> 
> While trying to fix this, I noticed that:
> 
> a) selftests/lib.mk already avoids the problem, and
> 
> b) The binderfs Makefile indavertently bypasses the selftests/lib.mk
> build system, and quitely uses Make's implicit build rules for .c files
> instead.
> 
> The Makefile attempts to set up both a dependency and a source file,
> neither of which was needed, because lib.mk is able to automatically
> handle both. This line:
> 
> binderfs_test: binderfs_test.c
> 
> ...causes Make's implicit rules to run, which builds binderfs_test
> without ever looking at lib.mk.
> 
> Fix this by simply deleting the "binderfs_test:" Makefile target and
> letting lib.mk handle it instead.
> 
> [1] 
> https://lore.kernel.org/all/20240329-selftests-libmk-llvm-rfc-v1-1-2f9ed7d1c...@valentinobst.de/
> 
> Fixes: 6e29225af902 ("binderfs: port tests to test harness infrastructure")
> Cc: Christian Brauner 
> Signed-off-by: John Hubbard 
> ---

Reviewed-by: Christian Brauner

Re: [RFC PATCH v2 07/12] famfs prep: Add fs/super.c:kill_char_super()

2024-05-03 Thread Christian Brauner

On Thu, May 02, 2024 at 05:25:33PM -0500, John Groves wrote:
> On 24/05/02 07:17PM, Al Viro wrote:
> > On Mon, Apr 29, 2024 at 12:04:23PM -0500, John Groves wrote:
> > > Famfs needs a slightly different kill_super variant than already existed.
> > > Putting it local to famfs would require exporting d_genocide(); this
> > > seemed a bit cleaner.
> > 
> > What's wrong with kill_litter_super()?
> 
> I struggled with that, I don't have my head fully around the superblock
> handling code.

Fyi, see my other mail where I point out what's wrong and one way to fix it.

Re: [PATCH 1/5] fs: Do not allow get_file() to resurrect 0 f_count

2024-05-03 Thread Christian Brauner

On Thu, May 02, 2024 at 04:03:24PM -0700, Kees Cook wrote:
> On Fri, May 03, 2024 at 12:53:56AM +0200, Jann Horn wrote:
> > On Fri, May 3, 2024 at 12:34 AM Kees Cook  wrote:
> > > If f_count reaches 0, calling get_file() should be a failure. Adjust to
> > > use atomic_long_inc_not_zero() and return NULL on failure. In the future
> > > get_file() can be annotated with __must_check, though that is not
> > > currently possible.
> > [...]
> > >  static inline struct file *get_file(struct file *f)
> > >  {
> > > -   atomic_long_inc(>f_count);
> > > +   if (unlikely(!atomic_long_inc_not_zero(>f_count)))
> > > +   return NULL;
> > > return f;
> > >  }
> > 
> > Oh, I really don't like this...
> > 
> > In most code, if you call get_file() on a file and see refcount zero,
> > that basically means you're in a UAF write situation, or that you
> > could be in such a situation if you had raced differently. It's
> > basically just like refcount_inc() in that regard.
> 
> Shouldn't the system attempt to not make things worse if it encounters
> an inc-from-0 condition? Yes, we've already lost the race for a UaF
> condition, but maybe don't continue on.
> 
> > And get_file() has semantics just like refcount_inc(): The caller
> > guarantees that it is already holding a reference to the file; and if
> 
> Yes, but if that guarantee is violated, we should do something about it.
> 
> > the caller is wrong about that, their subsequent attempt to clean up
> > the reference that they think they were already holding will likely
> > lead to UAF too. If get_file() sees a zero refcount, there is no safe
> > way to continue. And all existing callers of get_file() expect the
> > return value to be the same as the non-NULL pointer they passed in, so
> > they'll either ignore the result of this check and barrel on, or oops
> > with a NULL deref.
> > 
> > For callers that want to actually try incrementing file refcounts that
> > could be zero, which is only possible under specific circumstances, we
> > have helpers like get_file_rcu() and get_file_active().
> 
> So what's going on in here:
> https://lore.kernel.org/linux-hardening/20240502223341.1835070-2-keesc...@chromium.org/

Afaict, there's dma_buf_export() that allocates a new file and sets:

file->private_data = dmabuf;
dmabuf->file = file;

The file has f_op->release::dma_buf_file_release() as it's f_op->release
method. When that's called the file's refcount is already zero but the
file has not been freed yet. This will remove the dmabuf from some
public list but it won't free it.

Then we see that any dentry allocated for such a dmabuf file will have
dma_buf_dentry_ops which in turn has
dentry->d_release::dma_buf_release() which is where the actual release
of the dma buffer happens taken from dentry->d_fsdata.

That whole thing calls allocate_file_pseudo() which allocates a new
dentry specific to that struct file. That dentry is unhashed (no lookup)
and thus isn't retained so when dput() is called and it's the last
reference it's immediately followed by
dentry->d_release::dma_buf_release() which wipes the dmabuf itself.

The lifetime of the dmabuf is managed via fget()/fput(). So the lifetime
of the dmabuf and the lifetime of the file are almost identical afaict:

__fput()
-> f_op->release::dma_buf_file_release() // handles file specific freeing
-> dput()
   -> d_op->d_release::dma_buf_release() // handles dmabuf freeing
 // including the driver specific stuff.

If you fput() the file then the dmabuf will be freed as well immediately
after it when the dput() happens in __fput() (I struggle to come up with
an explanation why the freeing of the dmabuf is moved to
dentry->d_release instead of f_op->release itself but that's a separate
matter.).

So on the face of it without looking a little closer

static bool __must_check get_dma_buf_unless_doomed(struct dma_buf *dmabuf)
{
return atomic_long_inc_not_zero(>file->f_count) != 0L;
}

looks wrong or broken. Because if dmabuf->file->f_count is 0 it implies
that @dmabuf should have already been freed. So the bug would be in
accessing @dmabuf. And if @dmabuf is valid then it automatically means
that dmabuf->file->f_count isn't 0. So it looks like it could just use
get_file().

_But_ the interesting bits are in ttm_object_device_init(). This steals
the dma_buf_ops into tdev->ops. It then takes dma_buf_ops->release and
stores it away into tdev->dma_buf_release. Then it overrides the
dma_buf_ops->release with ttm_prime_dmabuf_release(). And that's where
the very questionable magic happens.

So now let's say the dmabuf is freed because of lat fput(). We now get
f_op->release::dma_buf_file_release(). Then it's followed by dput() and
ultimately dentry->d_release::dma_buf_release() as mentioned above.

But now when we get:

dentry->d_release::dma_buf_release()
-> dmabuf->ops->release::ttm_prime_dmabuf_release()

instead of the original dmabuf->ops->release method that was

Re: [PATCH 1/5] fs: Do not allow get_file() to resurrect 0 f_count

2024-05-03 Thread Christian Brauner

On Thu, May 02, 2024 at 04:03:24PM -0700, Kees Cook wrote:
> On Fri, May 03, 2024 at 12:53:56AM +0200, Jann Horn wrote:
> > On Fri, May 3, 2024 at 12:34 AM Kees Cook  wrote:
> > > If f_count reaches 0, calling get_file() should be a failure. Adjust to
> > > use atomic_long_inc_not_zero() and return NULL on failure. In the future
> > > get_file() can be annotated with __must_check, though that is not
> > > currently possible.
> > [...]
> > >  static inline struct file *get_file(struct file *f)
> > >  {
> > > -   atomic_long_inc(>f_count);
> > > +   if (unlikely(!atomic_long_inc_not_zero(>f_count)))
> > > +   return NULL;
> > > return f;
> > >  }
> > 
> > Oh, I really don't like this...
> > 
> > In most code, if you call get_file() on a file and see refcount zero,
> > that basically means you're in a UAF write situation, or that you
> > could be in such a situation if you had raced differently. It's
> > basically just like refcount_inc() in that regard.
> 
> Shouldn't the system attempt to not make things worse if it encounters
> an inc-from-0 condition? Yes, we've already lost the race for a UaF
> condition, but maybe don't continue on.
> 
> > And get_file() has semantics just like refcount_inc(): The caller
> > guarantees that it is already holding a reference to the file; and if
> 
> Yes, but if that guarantee is violated, we should do something about it.
> 
> > the caller is wrong about that, their subsequent attempt to clean up
> > the reference that they think they were already holding will likely
> > lead to UAF too. If get_file() sees a zero refcount, there is no safe
> > way to continue. And all existing callers of get_file() expect the
> > return value to be the same as the non-NULL pointer they passed in, so
> > they'll either ignore the result of this check and barrel on, or oops
> > with a NULL deref.
> > 
> > For callers that want to actually try incrementing file refcounts that
> > could be zero, which is only possible under specific circumstances, we
> > have helpers like get_file_rcu() and get_file_active().
> 
> So what's going on in here:
> https://lore.kernel.org/linux-hardening/20240502223341.1835070-2-keesc...@chromium.org/

Afaict, there's dma_buf_export() that allocates a new file and sets:

file->private_data = dmabuf;
dmabuf->file = file;

The file has f_op->release::dma_buf_file_release() as it's f_op->release
method. When that's called the file's refcount is already zero but the
file has not been freed yet. This will remove the dmabuf from some
public list but it won't free it.

Then we see that any dentry allocated for such a dmabuf file will have
dma_buf_dentry_ops which in turn has
dentry->d_release::dma_buf_release() which is where the actual release
of the dma buffer happens taken from dentry->d_fsdata.

That whole thing calls allocate_file_pseudo() which allocates a new
dentry specific to that struct file. That dentry is unhashed (no lookup)
and thus isn't retained so when dput() is called and it's the last
reference it's immediately followed by
dentry->d_release::dma_buf_release() which wipes the dmabuf itself.

The lifetime of the dmabuf is managed via fget()/fput(). So the lifetime
of the dmabuf and the lifetime of the file are almost identical afaict:

__fput()
-> f_op->release::dma_buf_file_release() // handles file specific freeing
-> dput()
   -> d_op->d_release::dma_buf_release() // handles dmabuf freeing
 // including the driver specific stuff.

If you fput() the file then the dmabuf will be freed as well immediately
after it when the dput() happens in __fput() (I struggle to come up with
an explanation why the freeing of the dmabuf is moved to
dentry->d_release instead of f_op->release itself but that's a separate
matter.).

So on the face of it without looking a little closer

static bool __must_check get_dma_buf_unless_doomed(struct dma_buf *dmabuf)
{
return atomic_long_inc_not_zero(>file->f_count) != 0L;
}

looks wrong or broken. Because if dmabuf->file->f_count is 0 it implies
that @dmabuf should have already been freed. So the bug would be in
accessing @dmabuf. And if @dmabuf is valid then it automatically means
that dmabuf->file->f_count isn't 0. So it looks like it could just use
get_file().

_But_ the interesting bits are in ttm_object_device_init(). This steals
the dma_buf_ops into tdev->ops. It then takes dma_buf_ops->release and
stores it away into tdev->dma_buf_release. Then it overrides the
dma_buf_ops->release with ttm_prime_dmabuf_release(). And that's where
the very questionable magic happens.

So now let's say the dmabuf is freed because of lat fput(). We now get
f_op->release::dma_buf_file_release(). Then it's followed by dput() and
ultimately dentry->d_release::dma_buf_release() as mentioned above.

But now when we get:

dentry->d_release::dma_buf_release()
-> dmabuf->ops->release::ttm_prime_dmabuf_release()

instead of the original dmabuf->ops->release method that was

Re: [PATCH v3 7/9] selftests/pidfd: Fix wrong expectation

2024-05-02 Thread Christian Brauner

On Mon, Apr 29, 2024 at 09:19:09PM +0200, Mickaël Salaün wrote:
> Replace a wrong EXPECT_GT(self->child_pid_exited, 0) with EXPECT_GE(),
> which will be actually tested on the parent and child sides with a
> following commit.
> 
> Cc: Christian Brauner 
> Cc: Shuah Khan 
> Reviewed-by: Kees Cook 
> Signed-off-by: Mickaël Salaün 
> Link: https://lore.kernel.org/r/20240429191911.2552580-8-...@digikod.net
> ---

Looks good to me,
Reviewed-by: Christian Brauner

Re: [PATCH v3 1/9] selftests/pidfd: Fix config for pidfd_setns_test

2024-05-02 Thread Christian Brauner

On Mon, Apr 29, 2024 at 09:19:03PM +0200, Mickaël Salaün wrote:
> Required by switch_timens() to open /proc/self/ns/time_for_children.
> 
> CONFIG_GENERIC_VDSO_TIME_NS is not available on UML, so pidfd_setns_test
> cannot be run successfully on this architecture.
> 
> Cc: Christian Brauner 
> Cc: Shuah Khan 
> Fixes: 2b40c5db73e2 ("selftests/pidfd: add pidfd setns tests")
> Reviewed-by: Kees Cook 
> Signed-off-by: Mickaël Salaün 
> Link: https://lore.kernel.org/r/20240429191911.2552580-2-...@digikod.net
> ---

Looks good to me,
Reviewed-by: Christian Brauner

Re: [RFC PATCH v2 08/12] famfs: module operations & fs_context

2024-04-30 Thread Christian Brauner

On Mon, Apr 29, 2024 at 12:04:24PM -0500, John Groves wrote:
> Start building up from the famfs module operations. This commit
> includes the following:
> 
> * Register as a file system
> * Parse mount parameters
> * Allocate or find (and initialize) a superblock via famfs_get_tree()
> * Lookup the host dax device, and bail if it's in use (or not dax)
> * Register as the holder of the dax device if it's available
> * Add Kconfig and Makefile misc to build famfs
> * Add FAMFS_SUPER_MAGIC to include/uapi/linux/magic.h
> * Add export of fs/namei.c:may_open_dev(), which famfs needs to call
> * Update MAINTAINERS file for the fs/famfs/ path
> 
> The following exports had to happen to enable famfs:
> 
> * This uses the new fs/super.c:kill_char_super() - the other kill*super
>   helpers were not quite right.
> * This uses the dev_dax_iomap export of dax_dev_get()
> 
> This commit builds but is otherwise too incomplete to run
> 
> Signed-off-by: John Groves 
> ---
>  MAINTAINERS|   1 +
>  fs/Kconfig |   2 +
>  fs/Makefile|   1 +
>  fs/famfs/Kconfig   |  10 ++
>  fs/famfs/Makefile  |   5 +
>  fs/famfs/famfs_inode.c | 345 +
>  fs/famfs/famfs_internal.h  |  36 
>  fs/namei.c |   1 +
>  include/uapi/linux/magic.h |   1 +
>  9 files changed, 402 insertions(+)
>  create mode 100644 fs/famfs/Kconfig
>  create mode 100644 fs/famfs/Makefile
>  create mode 100644 fs/famfs/famfs_inode.c
>  create mode 100644 fs/famfs/famfs_internal.h
> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 3f2d847dcf01..365d678e2f40 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -8188,6 +8188,7 @@ L:  linux-...@vger.kernel.org
>  L:   linux-fsde...@vger.kernel.org
>  S:   Supported
>  F:   Documentation/filesystems/famfs.rst
> +F:   fs/famfs
>  
>  FANOTIFY
>  M:   Jan Kara 
> diff --git a/fs/Kconfig b/fs/Kconfig
> index a46b0cbc4d8f..53b4629e92a0 100644
> --- a/fs/Kconfig
> +++ b/fs/Kconfig
> @@ -140,6 +140,8 @@ source "fs/autofs/Kconfig"
>  source "fs/fuse/Kconfig"
>  source "fs/overlayfs/Kconfig"
>  
> +source "fs/famfs/Kconfig"
> +
>  menu "Caches"
>  
>  source "fs/netfs/Kconfig"
> diff --git a/fs/Makefile b/fs/Makefile
> index 6ecc9b0a53f2..3393f399a9e9 100644
> --- a/fs/Makefile
> +++ b/fs/Makefile
> @@ -129,3 +129,4 @@ obj-$(CONFIG_EFIVAR_FS)   += efivarfs/
>  obj-$(CONFIG_EROFS_FS)   += erofs/
>  obj-$(CONFIG_VBOXSF_FS)  += vboxsf/
>  obj-$(CONFIG_ZONEFS_FS)  += zonefs/
> +obj-$(CONFIG_FAMFS) += famfs/
> diff --git a/fs/famfs/Kconfig b/fs/famfs/Kconfig
> new file mode 100644
> index ..edb8980820f7
> --- /dev/null
> +++ b/fs/famfs/Kconfig
> @@ -0,0 +1,10 @@
> +
> +
> +config FAMFS
> +   tristate "famfs: shared memory file system"
> +   depends on DEV_DAX && FS_DAX && DEV_DAX_IOMAP
> +   help
> +   Support for the famfs file system. Famfs is a dax file system that
> +   can support scale-out shared access to fabric-attached memory
> +   (e.g. CXL shared memory). Famfs is not a general purpose file system;
> +   it is an enabler for data sets in shared memory.
> diff --git a/fs/famfs/Makefile b/fs/famfs/Makefile
> new file mode 100644
> index ..62230bcd6793
> --- /dev/null
> +++ b/fs/famfs/Makefile
> @@ -0,0 +1,5 @@
> +# SPDX-License-Identifier: GPL-2.0
> +
> +obj-$(CONFIG_FAMFS) += famfs.o
> +
> +famfs-y := famfs_inode.o
> diff --git a/fs/famfs/famfs_inode.c b/fs/famfs/famfs_inode.c
> new file mode 100644
> index ..61306240fc0b
> --- /dev/null
> +++ b/fs/famfs/famfs_inode.c
> @@ -0,0 +1,345 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * famfs - dax file system for shared fabric-attached memory
> + *
> + * Copyright 2023-2024 Micron Technology, inc
> + *
> + * This file system, originally based on ramfs the dax support from xfs,
> + * is intended to allow multiple host systems to mount a common file system
> + * view of dax files that map to shared memory.
> + */
> +
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +#include "famfs_internal.h"
> +
> +#define FAMFS_DEFAULT_MODE   0755
> +
> +static struct inode *famfs_get_inode(struct super_block *sb,
> +  const struct inode *dir,
> +  umode_t mode, dev_t dev)
> +{
> + struct inode *inode = new_inode(sb);
> + struct timespec64 tv;
> +
> + if (!inode)
> + return NULL;
> +
> + inode->i_ino = get_next_ino();
> + inode_init_owner(_mnt_idmap, inode, dir, mode);
> + inode->i_mapping->a_ops = _aops;
> + mapping_set_gfp_mask(inode->i_mapping, GFP_HIGHUSER);
> + mapping_set_unevictable(inode->i_mapping);
> + tv = inode_set_ctime_current(inode);
> + inode_set_mtime_to_ts(inode, tv);
> +

Re: [RFC PATCH v2 0/2] ima: Fix detection of read/write violations on stacked filesystems

2024-04-26 Thread Christian Brauner

On Tue, Apr 23, 2024 at 05:30:52PM +0300, Amir Goldstein wrote:
> On Tue, Apr 23, 2024 at 4:21 PM Roberto Sassu
>  wrote:
> >
> > On Tue, 2024-04-23 at 09:02 +0300, Amir Goldstein wrote:
> > > On Mon, Apr 22, 2024 at 6:07 PM Stefan Berger  
> > > wrote:
> > > >
> > > > This series fixes the detection of read/write violations on stacked
> > > > filesystems. To be able to access the relevant dentries necessary to
> > > > detect files opened for writing on a stacked filesystem a new 
> > > > d_real_type
> > > > D_REAL_FILEDATA is introduced that allows callers to access all relevant
> > > > files involved in a stacked filesystem while traversing the layers.
> > > >
> > >
> > > Stefan,
> > >
> > > Both Miklos and myself objected to this solution:
> > > https://lore.kernel.org/linux-unionfs/cajfpeguctireyecoigcasjwpgpcx2nyfmz8h8ghgw-0uykf...@mail.gmail.com/
> > >
> > > Not sure what you are hoping to achieve from re-posting the same solution.
> > >
> > > I stopped counting how many times I already argued that *all* IMA/EVM
> > > assertions,
> > > including rw-ro violations should be enforced only on the real inode.
> >
> > I have hopefully a better idea. We should detect violations at each
> > level of the stack independently. And IMA should be invoked each time
> > overlayfs uses an underlying layer.
> >
> > That is currently not easy, from the IMA policy perspective, because
> > there are filesystem-specific rules, such as fsname= or fsuuid=. At the
> > moment, I'm not planning to solve this, but I'm thinking to use for
> > example FMODE_BACKING to ignore the filesystem-specific keywords and
> > match the rule anyway.
> >
> > For now, I'm only addressing the call to underlying layers. To make
> > sure that IMA evaluates every layer, I added a rule that checks the
> > inode UID:
> >
> > measure fowner=2000 mask=MAY_READ
> >
> >
> > I just investigated a bit, and I made some changes (for now, I'm just
> > making it work, and you tell me what you think).
> 
> I did not examine this up close, but this seems like a change in the right
> direction.
> Will need Christian's approval that this does not break any assumptions
> made on backing files.

In principle I don't care if IMA wants to call yet another security hook
in the backing file layer. I suspect it will impact performace if IMA is
enabled. So that's something to keep in mind. But it's certainly better
than blatantly abusing the dcache to achieve this.

> > diff --git a/fs/backing-file.c b/fs/backing-file.c
> > index 740185198db3..8016f62cf770 100644
> > --- a/fs/backing-file.c
> > +++ b/fs/backing-file.c
> > @@ -12,6 +12,7 @@
> >  #include 
> >  #include 
> >  #include 
> > +#include 
> >
> >  #include "internal.h"
> >
> > @@ -40,12 +41,16 @@ struct file *backing_file_open(const struct path
> > *user_path, int flags,
> > if (IS_ERR(f))
> > return f;
> >
> > +   f->f_mode |= OPEN_FMODE(flags);
> > +
> > path_get(user_path);
> > *backing_file_user_path(f) = *user_path;
> > error = vfs_open(real_path, f);
> > if (error) {
> > fput(f);
> > f = ERR_PTR(error);
> > +   } else {
> > +   security_file_post_open(f, ACC_MODE(flags));
> > }
> >
> > return f;
> >
> >
> > Setup:
> >
> > # mount -t overlay -olowerdir=a,upperdir=b,workdir=c overlay d
> >
> > open is a tool with the following syntax:
> >
> > open  
> >
> > It performs the open, and waits for user input before closing the file.
> >
> >
> >
> > ToMToU (Time of Measurement - Time of Use):
> >
> > Same fs (overlayfs)
> >
> > # /root/open /root/test-dir/d/test-file r (terminal 1)
> > # /root/open /root/test-dir/d/test-file w (terminal 2)
> >
> > This works:
> >
> > 10 35435d0858d895b90097306171a2e5fcc7f5da9e ima-ng 
> > sha256:0e4acf326a82c6bded9d86f48d272d7a036b6490081bb6466ecc2a0e416b244a 
> > boot_aggregate
> > 10 cef529d5d1032ffb6d3e2154664c83ba18cf2576 ima-ng 
> > sha256:f2ca1bb6c7e907d06dafe4687e579fce76b37e4e93b7605022da52e6ccc26fd2 
> > test-file
> > 10 694277487b9753db78446192231b59b7be7c03ad ima-ng 
> > sha256:f2ca1bb6c7e907d06dafe4687e579fce76b37e4e93b7605022da52e6ccc26fd2 
> > /root/test-dir/d/test-file
> > 10  ima-ng 
> > sha256: 
> > test-file
> > 10  ima-ng 
> > sha256: 
> > /root/test-dir/d/test-file
> >
> > This is the result of calling IMA at both layers, and the violation of
> > course happens twice.
> >
> > This is also confirmed in the logs:
> >
> > Apr 23 14:52:45 fedora audit[994]: INTEGRITY_PCR pid=994 uid=0 auid=0 ses=3 
> > subj=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 op=invalid_pcr 
> > cause=ToMToU comm="open" name="test-file" dev="sda3" ino=995512 res=1 
> > errno=0
> > Apr 23 14:52:45 fedora audit[994]: INTEGRITY_PCR pid=994 uid=0 auid=0

Re: [PATCH -next] erofs: get rid of erofs_fs_context

2024-04-19 Thread Christian Brauner

On Fri, Apr 19, 2024 at 11:14:59AM +0800, Baokun Li wrote:
> Instead of allocating the erofs_sb_info in get_tree() allocate it during
> init_fs_context() and after this erofs_fs_context is no longer needed,
> so replace ctx with sbi, no functional changes.
> 
> Suggested-by: Jingbo Xu 
> Signed-off-by: Baokun Li 
> ---
> Hi Gao Xiang,
> Hi Jingbo,
> 
> I noticed that Christian's original patch has been merged into the next
> branch, so I'm sending this patch out separately.

An an accident on my part as I left it in the vfs.fixes branch. I expect
that the erofs tree will pick it up.

> 
> Regards,
> Baokun
> 
>  fs/erofs/internal.h |   7 ---
>  fs/erofs/super.c| 112 ++--
>  2 files changed, 46 insertions(+), 73 deletions(-)
> 
> diff --git a/fs/erofs/internal.h b/fs/erofs/internal.h
> index 116c1d5d1932..53ebba952a2f 100644
> --- a/fs/erofs/internal.h
> +++ b/fs/erofs/internal.h
> @@ -84,13 +84,6 @@ struct erofs_dev_context {
>   bool flatdev;
>  };
>  
> -struct erofs_fs_context {
> - struct erofs_mount_opts opt;
> - struct erofs_dev_context *devs;
> - char *fsid;
> - char *domain_id;
> -};
> -
>  /* all filesystem-wide lz4 configurations */
>  struct erofs_sb_lz4_info {
>   /* # of pages needed for EROFS lz4 rolling decompression */
> diff --git a/fs/erofs/super.c b/fs/erofs/super.c
> index 4fc34b984e51..7172271290b9 100644
> --- a/fs/erofs/super.c
> +++ b/fs/erofs/super.c
> @@ -370,18 +370,18 @@ static int erofs_read_superblock(struct super_block *sb)
>   return ret;
>  }
>  
> -static void erofs_default_options(struct erofs_fs_context *ctx)
> +static void erofs_default_options(struct erofs_sb_info *sbi)
>  {
>  #ifdef CONFIG_EROFS_FS_ZIP
> - ctx->opt.cache_strategy = EROFS_ZIP_CACHE_READAROUND;
> - ctx->opt.max_sync_decompress_pages = 3;
> - ctx->opt.sync_decompress = EROFS_SYNC_DECOMPRESS_AUTO;
> + sbi->opt.cache_strategy = EROFS_ZIP_CACHE_READAROUND;
> + sbi->opt.max_sync_decompress_pages = 3;
> + sbi->opt.sync_decompress = EROFS_SYNC_DECOMPRESS_AUTO;
>  #endif
>  #ifdef CONFIG_EROFS_FS_XATTR
> - set_opt(>opt, XATTR_USER);
> + set_opt(>opt, XATTR_USER);
>  #endif
>  #ifdef CONFIG_EROFS_FS_POSIX_ACL
> - set_opt(>opt, POSIX_ACL);
> + set_opt(>opt, POSIX_ACL);
>  #endif
>  }
>  
> @@ -426,16 +426,16 @@ static const struct fs_parameter_spec 
> erofs_fs_parameters[] = {
>  static bool erofs_fc_set_dax_mode(struct fs_context *fc, unsigned int mode)
>  {
>  #ifdef CONFIG_FS_DAX
> - struct erofs_fs_context *ctx = fc->fs_private;
> + struct erofs_sb_info *sbi = fc->s_fs_info;
>  
>   switch (mode) {
>   case EROFS_MOUNT_DAX_ALWAYS:
> - set_opt(>opt, DAX_ALWAYS);
> - clear_opt(>opt, DAX_NEVER);
> + set_opt(>opt, DAX_ALWAYS);
> + clear_opt(>opt, DAX_NEVER);
>   return true;
>   case EROFS_MOUNT_DAX_NEVER:
> - set_opt(>opt, DAX_NEVER);
> - clear_opt(>opt, DAX_ALWAYS);
> + set_opt(>opt, DAX_NEVER);
> + clear_opt(>opt, DAX_ALWAYS);
>   return true;
>   default:
>   DBG_BUGON(1);
> @@ -450,7 +450,7 @@ static bool erofs_fc_set_dax_mode(struct fs_context *fc, 
> unsigned int mode)
>  static int erofs_fc_parse_param(struct fs_context *fc,
>   struct fs_parameter *param)
>  {
> - struct erofs_fs_context *ctx = fc->fs_private;
> + struct erofs_sb_info *sbi = fc->s_fs_info;
>   struct fs_parse_result result;
>   struct erofs_device_info *dif;
>   int opt, ret;
> @@ -463,9 +463,9 @@ static int erofs_fc_parse_param(struct fs_context *fc,
>   case Opt_user_xattr:
>  #ifdef CONFIG_EROFS_FS_XATTR
>   if (result.boolean)
> - set_opt(>opt, XATTR_USER);
> + set_opt(>opt, XATTR_USER);
>   else
> - clear_opt(>opt, XATTR_USER);
> + clear_opt(>opt, XATTR_USER);
>  #else
>   errorfc(fc, "{,no}user_xattr options not supported");
>  #endif
> @@ -473,16 +473,16 @@ static int erofs_fc_parse_param(struct fs_context *fc,
>   case Opt_acl:
>  #ifdef CONFIG_EROFS_FS_POSIX_ACL
>   if (result.boolean)
> - set_opt(>opt, POSIX_ACL);
> + set_opt(>opt, POSIX_ACL);
>   else
> - clear_opt(>opt, POSIX_ACL);
> + clear_opt(>opt, POSIX_ACL);
>  #else
>   errorfc(fc, "{,no}acl options not supported");
>  #endif
>   break;
>   case Opt_cache_strategy:
>  #ifdef CONFIG_EROFS_FS_ZIP
> - ctx->opt.cache_strategy = result.uint_32;
> + sbi->opt.cache_strategy = result.uint_32;
>  #else
>   errorfc(fc, "compression not supported, cache_strategy 
> ignored");
>  #endif
> @@ -504,27 +504,27 @@ static int erofs_fc_parse_param(struct fs_context *fc,
>

Re: tools/testing/selftests/clone3/clone3_set_tid.c appears to always pass?

2024-04-18 Thread Christian Brauner

On Wed, Apr 17, 2024 at 08:22:22AM -0700, Nathan Chancellor wrote:
> Hi Christian,
> 
> I am looking at tools/testing/selftests/clone3/clone3_set_tid.c as part
> of a patch to clean up the uses of 'return ksft_exit_...();' throughout
> the selftests (as they call exit() so they do not return) and I noticed
> that it seems to always pass even when there may have been an error?
> 
>   if (waitpid(ns_pid, , 0) < 0) {
>   ksft_print_msg("Child returned %s\n", strerror(errno));
>   ret = -errno;
>   goto out;
>   }
> 
>   ...
>   out:
>   ret = 0;
> 
>   return !ret ? ksft_exit_pass() : ksft_exit_fail();
>   }
> 
> Should the ret and out label positions be switched? Alternatively, it
> seems like ret and out do not have that many uses, perhaps it would just
> be better to call the ksft_exit_...() directly in their respective
> paths? I am not going to touch it as part of my patch but I felt it was
> worth reporting since it appears to have been there since the
> introduction of this test in commit 41585bbeeef9 ("selftests: add tests
> for clone3() with *set_tid").

Uh, good point. Let me Cc the original author.

Re: [PATCH] erofs: set SB_NODEV sb_flags when mounting with fsid

2024-04-16 Thread Christian Brauner

> > I'm not sure how to resolve it in EROFS itself, anyway...

Instead of allocating the erofs_sb_info in fill_super() allocate it
during erofs_get_tree() and then you can ensure that you always have the
info you need available during erofs_kill_sb(). See the appended
(untested) patch.
>From e4f586a41748b6edc05aca36d49b7b39e55def81 Mon Sep 17 00:00:00 2001
From: Christian Brauner 
Date: Mon, 15 Apr 2024 20:17:46 +0800
Subject: [PATCH] erofs: reliably distinguish block based and fscache mode

When erofs_kill_sb() is called in block dev based mode, s_bdev may not have
been initialised yet, and if CONFIG_EROFS_FS_ONDEMAND is enabled, it will
be mistaken for fscache mode, and then attempt to free an anon_dev that has
never been allocated, triggering the following warning:


ida_free called for id=0 which is not allocated.
WARNING: CPU: 14 PID: 926 at lib/idr.c:525 ida_free+0x134/0x140
Modules linked in:
CPU: 14 PID: 926 Comm: mount Not tainted 6.9.0-rc3-dirty #630
RIP: 0010:ida_free+0x134/0x140
Call Trace:
 
 erofs_kill_sb+0x81/0x90
 deactivate_locked_super+0x35/0x80
 get_tree_bdev+0x136/0x1e0
 vfs_get_tree+0x2c/0xf0
 do_new_mount+0x190/0x2f0
 [...]


Instead of allocating the erofs_sb_info in fill_super() allocate it
during erofs_get_tree() and ensure that erofs can always have the info
available during erofs_kill_sb().

Signed-off-by: Baokun Li 
Signed-off-by: Christian Brauner 
---
 fs/erofs/super.c | 29 -
 1 file changed, 16 insertions(+), 13 deletions(-)

diff --git a/fs/erofs/super.c b/fs/erofs/super.c
index c0eb139adb07..4ed80154edf8 100644
--- a/fs/erofs/super.c
+++ b/fs/erofs/super.c
@@ -581,7 +581,7 @@ static const struct export_operations erofs_export_ops = {
 static int erofs_fc_fill_super(struct super_block *sb, struct fs_context *fc)
 {
 	struct inode *inode;
-	struct erofs_sb_info *sbi;
+	struct erofs_sb_info *sbi = EROFS_SB(sb);
 	struct erofs_fs_context *ctx = fc->fs_private;
 	int err;
 
@@ -590,15 +590,10 @@ static int erofs_fc_fill_super(struct super_block *sb, struct fs_context *fc)
 	sb->s_maxbytes = MAX_LFS_FILESIZE;
 	sb->s_op = _sops;
 
-	sbi = kzalloc(sizeof(*sbi), GFP_KERNEL);
-	if (!sbi)
-		return -ENOMEM;
-
 	sb->s_fs_info = sbi;
 	sbi->opt = ctx->opt;
 	sbi->devs = ctx->devs;
 	ctx->devs = NULL;
-	sbi->fsid = ctx->fsid;
 	ctx->fsid = NULL;
 	sbi->domain_id = ctx->domain_id;
 	ctx->domain_id = NULL;
@@ -707,8 +702,15 @@ static int erofs_fc_fill_super(struct super_block *sb, struct fs_context *fc)
 static int erofs_fc_get_tree(struct fs_context *fc)
 {
 	struct erofs_fs_context *ctx = fc->fs_private;
+	struct erofs_sb_info *sbi;
+
+	sbi = kzalloc(sizeof(*sbi), GFP_KERNEL);
+	if (!sbi)
+		return -ENOMEM;
 
-	if (IS_ENABLED(CONFIG_EROFS_FS_ONDEMAND) && ctx->fsid)
+	fc->s_fs_info = sbi;
+	sbi->fsid = ctx->fsid;
+	if (IS_ENABLED(CONFIG_EROFS_FS_ONDEMAND) && sbi->fsid)
 		return get_tree_nodev(fc, erofs_fc_fill_super);
 
 	return get_tree_bdev(fc, erofs_fc_fill_super);
@@ -762,11 +764,15 @@ static void erofs_free_dev_context(struct erofs_dev_context *devs)
 static void erofs_fc_free(struct fs_context *fc)
 {
 	struct erofs_fs_context *ctx = fc->fs_private;
+	struct erofs_sb_info *sbi = fc->s_fs_info;
 
 	erofs_free_dev_context(ctx->devs);
 	kfree(ctx->fsid);
 	kfree(ctx->domain_id);
 	kfree(ctx);
+
+	if (sbi)
+		kfree(sbi);
 }
 
 static const struct fs_context_operations erofs_context_ops = {
@@ -783,6 +789,7 @@ static int erofs_init_fs_context(struct fs_context *fc)
 	ctx = kzalloc(sizeof(*ctx), GFP_KERNEL);
 	if (!ctx)
 		return -ENOMEM;
+
 	ctx->devs = kzalloc(sizeof(struct erofs_dev_context), GFP_KERNEL);
 	if (!ctx->devs) {
 		kfree(ctx);
@@ -799,17 +806,13 @@ static int erofs_init_fs_context(struct fs_context *fc)
 
 static void erofs_kill_sb(struct super_block *sb)
 {
-	struct erofs_sb_info *sbi;
+	struct erofs_sb_info *sbi = EROFS_SB(sb);
 
-	if (erofs_is_fscache_mode(sb))
+	if (IS_ENABLED(CONFIG_EROFS_FS_ONDEMAND) && sbi->fsid)
 		kill_anon_super(sb);
 	else
 		kill_block_super(sb);
 
-	sbi = EROFS_SB(sb);
-	if (!sbi)
-		return;
-
 	erofs_free_dev_context(sbi->devs);
 	fs_put_dax(sbi->dax_dev, NULL);
 	erofs_fscache_unregister_fs(sb);
-- 
2.43.0

Re: [PATCH v2 04/34] md: port block device access to file

2024-04-15 Thread Christian Brauner

On Mon, Apr 15, 2024 at 10:35:53PM +0800, Ming Lei wrote:
> On Mon, Apr 15, 2024 at 02:35:17PM +0200, Christian Brauner wrote:
> > On Mon, Apr 15, 2024 at 05:26:19PM +0800, Ming Lei wrote:
> > > Hello,
> > > 
> > > On Tue, Jan 23, 2024 at 02:26:21PM +0100, Christian Brauner wrote:
> > > > Signed-off-by: Christian Brauner 
> > > > ---
> > > >  drivers/md/dm.c   | 23 +--
> > > >  drivers/md/md.c   | 12 ++--
> > > >  drivers/md/md.h   |  2 +-
> > > >  include/linux/device-mapper.h |  2 +-
> > > >  4 files changed, 21 insertions(+), 18 deletions(-)
> > > > 
> > > > diff --git a/drivers/md/dm.c b/drivers/md/dm.c
> > > > index 8dcabf84d866..87de5b5682ad 100644
> > > > --- a/drivers/md/dm.c
> > > > +++ b/drivers/md/dm.c
> > > 
> > > ...
> > > 
> > > > @@ -775,7 +778,7 @@ static void close_table_device(struct table_device 
> > > > *td, struct mapped_device *md
> > > >  {
> > > > if (md->disk->slave_dir)
> > > > bd_unlink_disk_holder(td->dm_dev.bdev, md->disk);
> > > > -   bdev_release(td->dm_dev.bdev_handle);
> > > > +   fput(td->dm_dev.bdev_file);
> > > 
> > > The above change caused regression on 'dmsetup remove_all'.
> > > 
> > > blkdev_release() is delayed because of fput(), so dm_lock_for_deletion
> > > returns -EBUSY, then this dm disk is skipped in remove_all().
> > > 
> > > Force to mark DMF_DEFERRED_REMOVE might solve it, but need our device
> > > mapper guys to check if it is safe.
> > > 
> > > Or other better solution?
> > 
> > Yeah, I think there is. You can just switch all fput() instances in
> > device mapper to bdev_fput() which is mainline now. This will yield the
> > device and make it able to be reclaimed. Should be as simple as the
> > patch below. Could you test this and send a patch based on this (I'm on
> > a prolonged vacation so I don't have time right now.):
> 
> Unfortunately it doesn't work.
> 
> Here the problem is that blkdev_release() is delayed, which changes
> 'dmsetup remove_all' behavior, and causes that some of dm disks aren't
> removed.
> 
> Please see dm_lock_for_deletion() and dm_blk_open()/dm_blk_close().

So you really need blkdev_release() itself to be synchronous? Groan, in
that case use __fput_sync() instead of fput() which ensures that this
file is closed synchronously.

Re: [PATCH] erofs: set SB_NODEV sb_flags when mounting with fsid

2024-04-15 Thread Christian Brauner

On Mon, Apr 15, 2024 at 08:17:46PM +0800, Baokun Li wrote:
> When erofs_kill_sb() is called in block dev based mode, s_bdev may not have
> been initialised yet, and if CONFIG_EROFS_FS_ONDEMAND is enabled, it will
> be mistaken for fscache mode, and then attempt to free an anon_dev that has
> never been allocated, triggering the following warning:
> 
> 
> ida_free called for id=0 which is not allocated.
> WARNING: CPU: 14 PID: 926 at lib/idr.c:525 ida_free+0x134/0x140
> Modules linked in:
> CPU: 14 PID: 926 Comm: mount Not tainted 6.9.0-rc3-dirty #630
> RIP: 0010:ida_free+0x134/0x140
> Call Trace:
>  
>  erofs_kill_sb+0x81/0x90
>  deactivate_locked_super+0x35/0x80
>  get_tree_bdev+0x136/0x1e0
>  vfs_get_tree+0x2c/0xf0
>  do_new_mount+0x190/0x2f0
>  [...]
> 
> 
> To avoid this problem, add SB_NODEV to fc->sb_flags after successfully
> parsing the fsid, and then the superblock inherits this flag when it is
> allocated, so that the sb_flags can be used to distinguish whether it is
> in block dev based mode when calling erofs_kill_sb().
> 
> Signed-off-by: Baokun Li 
> ---
>  fs/erofs/super.c | 7 +++
>  1 file changed, 3 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/erofs/super.c b/fs/erofs/super.c
> index b21bd8f78dc1..7539ce7d64bc 100644
> --- a/fs/erofs/super.c
> +++ b/fs/erofs/super.c
> @@ -520,6 +520,7 @@ static int erofs_fc_parse_param(struct fs_context *fc,
>   ctx->fsid = kstrdup(param->string, GFP_KERNEL);
>   if (!ctx->fsid)
>   return -ENOMEM;
> + fc->sb_flags |= SB_NODEV;

Hm, I wouldn't do it this way. That's an abuse of that flag imho.
Record the information in the erofs_fs_context if you need to.

Re: [PATCH v2 04/34] md: port block device access to file

2024-04-15 Thread Christian Brauner

On Mon, Apr 15, 2024 at 05:26:19PM +0800, Ming Lei wrote:
> Hello,
> 
> On Tue, Jan 23, 2024 at 02:26:21PM +0100, Christian Brauner wrote:
> > Signed-off-by: Christian Brauner 
> > ---
> >  drivers/md/dm.c   | 23 +--
> >  drivers/md/md.c   | 12 ++--
> >  drivers/md/md.h   |  2 +-
> >  include/linux/device-mapper.h |  2 +-
> >  4 files changed, 21 insertions(+), 18 deletions(-)
> > 
> > diff --git a/drivers/md/dm.c b/drivers/md/dm.c
> > index 8dcabf84d866..87de5b5682ad 100644
> > --- a/drivers/md/dm.c
> > +++ b/drivers/md/dm.c
> 
> ...
> 
> > @@ -775,7 +778,7 @@ static void close_table_device(struct table_device *td, 
> > struct mapped_device *md
> >  {
> > if (md->disk->slave_dir)
> > bd_unlink_disk_holder(td->dm_dev.bdev, md->disk);
> > -   bdev_release(td->dm_dev.bdev_handle);
> > +   fput(td->dm_dev.bdev_file);
> 
> The above change caused regression on 'dmsetup remove_all'.
> 
> blkdev_release() is delayed because of fput(), so dm_lock_for_deletion
> returns -EBUSY, then this dm disk is skipped in remove_all().
> 
> Force to mark DMF_DEFERRED_REMOVE might solve it, but need our device
> mapper guys to check if it is safe.
> 
> Or other better solution?

Yeah, I think there is. You can just switch all fput() instances in
device mapper to bdev_fput() which is mainline now. This will yield the
device and make it able to be reclaimed. Should be as simple as the
patch below. Could you test this and send a patch based on this (I'm on
a prolonged vacation so I don't have time right now.):

diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 56aa2a8b9d71..0f681a1e70af 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -765,7 +765,7 @@ static struct table_device *open_table_device(struct 
mapped_device *md,
return td;

 out_blkdev_put:
-   fput(bdev_file);
+   bdev_fput(bdev_file);
 out_free_td:
kfree(td);
return ERR_PTR(r);
@@ -778,7 +778,7 @@ static void close_table_device(struct table_device *td, 
struct mapped_device *md
 {
if (md->disk->slave_dir)
bd_unlink_disk_holder(td->dm_dev.bdev, md->disk);
-   fput(td->dm_dev.bdev_file);
+   bdev_fput(td->dm_dev.bdev_file);
put_dax(td->dm_dev.dax_dev);
list_del(>list);
kfree(td);

Re: [PATCH vfs.all 19/26] dm-vdo: convert to use bdev_file

2024-04-11 Thread Christian Brauner

On Wed, Apr 10, 2024 at 06:40:22PM +0100, Al Viro wrote:
> On Wed, Apr 10, 2024 at 01:26:47PM -0400, Matthew Sakai wrote:
> 
> > > 'dm_dev->bdev_file', it's ok to get inode from the file.
> 
> It can be done much easier, though -
> 
> [PATCH] dm-vdo: use bdev_nr_bytes(bdev) instead of i_size_read(bdev->bd_inode)
> 
> going to be faster, actually - shift is cheaper than dereference...
> 
> Signed-off-by: Al Viro 
> ---

I've used that patch instead of the original one.

Re: [PATCH] fs: Set file_handle::handle_bytes before referencing file_handle::f_handle

2024-04-05 Thread Christian Brauner

On Thu, Apr 04, 2024 at 11:19:00AM +0200, Jan Kara wrote:
> On Wed 03-04-24 14:54:03, Kees Cook wrote:
> > With adding __counted_by(handle_bytes) to struct file_handle, we need
> > to explicitly set it in the one place it wasn't yet happening prior to
> > accessing the flex array "f_handle".
> > 
> > Fixes: 1b43c4629756 ("fs: Annotate struct file_handle with __counted_by() 
> > and use struct_size()")
> > Signed-off-by: Kees Cook 
> 
> OK, so this isn't really a functional bug AFAIU but the compiler will
> wrongly complain we are accessing handle->f_handle beyond claimed array
> size (because handle->handle_bytes == 0 at that point). Am I right? If

And really, this also needs to please be mentioned in the commit message
because from reading the commit message I'm not even sure what this
patch is trying to fix.

Re: [PATCH 15/26] mm: Export writeback_iter()

2024-04-05 Thread Christian Brauner

On Wed, Apr 03, 2024 at 01:58:15PM +0100, David Howells wrote:
> Christoph Hellwig  wrote:
> 
> > > So why are we bothering with EXPORT_SYMBOL at all?  Why don't you just
> > > send a patch replace all of them with EXPORT_SYMBOL_GPL()?
> > 
> > No my business.
> 
> Clearly it is as you're gradually replacing APIs with stuff that is GPL'd.
> 
> > But if you want to side track this let me just put this in here:
> > 
> > NAK to the non-GPL EXPORT of writeback_iter().
> 
> Very well, I'll switch that export to GPL.  Christian, if you can amend that
> patch in your tree?

Sorted yesterday night!

Re: [PATCH 00/26] netfs, afs, 9p, cifs: Rework netfs to use ->writepages() to copy to cache

2024-04-02 Thread Christian Brauner

On Thu, 28 Mar 2024 16:33:52 +, David Howells wrote:
> The primary purpose of these patches is to rework the netfslib writeback
> implementation such that pages read from the cache are written to the cache
> through ->writepages(), thereby allowing the fscache page flag to be
> retired.
> 
> The reworking also:
> 
> [...]

Pulled from netfs-writeback which contains the minor fixes pointed out.

---

Applied to the vfs.netfs branch of the vfs/vfs.git tree.
Patches in the vfs.netfs branch should appear in linux-next soon.

Please report any outstanding bugs that were missed during review in a
new review to the original patch series allowing us to drop it.

It's encouraged to provide Acked-bys and Reviewed-bys even though the
patch has now been applied. If possible patch trailers will be updated.

Note that commit hashes shown below are subject to change due to rebase,
trailer updates or similar. If in doubt, please check the listed branch.

tree:   https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git
branch: vfs.netfs

Re: [PATCH] orangefs: cleanup uses of strncpy

2024-04-01 Thread Christian Brauner

On Fri, 22 Mar 2024 21:41:18 +, Justin Stitt wrote:
> strncpy() is deprecated for use on NUL-terminated destination strings
> [1] and as such we should prefer more robust and less ambiguous string
> interfaces.
> 
> There is some care taken to ensure these destination buffers are
> NUL-terminated by bounding the strncpy()'s by ORANGEFS_NAME_MAX - 1 or
> ORANGEFS_MAX_SERVER_ADDR_LEN - 1. Instead, we can use the new 2-argument
> version of strscpy() to guarantee NUL-termination on the destination
> buffers while simplifying the code.
> 
> [...]

If this needs to go separately from the vfs trees let me know.

---

Applied to the vfs.misc branch of the vfs/vfs.git tree.
Patches in the vfs.misc branch should appear in linux-next soon.

Please report any outstanding bugs that were missed during review in a
new review to the original patch series allowing us to drop it.

It's encouraged to provide Acked-bys and Reviewed-bys even though the
patch has now been applied. If possible patch trailers will be updated.

Note that commit hashes shown below are subject to change due to rebase,
trailer updates or similar. If in doubt, please check the listed branch.

tree:   https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git
branch: vfs.misc

[1/1] orangefs: cleanup uses of strncpy
  https://git.kernel.org/vfs/vfs/c/fc10fed37526

Re: kernel crash in mknod

2024-03-28 Thread Christian Brauner

On Thu, Mar 28, 2024 at 01:24:25PM +0200, Roberto Sassu wrote:
> On 3/28/2024 12:08 PM, Christian Brauner wrote:
> > On Thu, Mar 28, 2024 at 12:53:40PM +0200, Roberto Sassu wrote:
> > > On 3/26/2024 12:40 PM, Christian Brauner wrote:
> > > > > we can change the parameter of security_path_post_mknod() from
> > > > > dentry to inode?
> > > > 
> > > > If all current callers only operate on the inode then it seems the best
> > > > to only pass the inode. If there's some reason someone later needs a
> > > > dentry the hook can always be changed.
> > > 
> > > Ok, so the crash is likely caused by:
> > > 
> > > void security_path_post_mknod(struct mnt_idmap *idmap, struct dentry
> > > *dentry)
> > > {
> > >  if (unlikely(IS_PRIVATE(d_backing_inode(dentry
> > > 
> > > I guess we can also simply check if there is an inode attached to the
> > > dentry, to minimize the changes. I can do both.
> > > 
> > > More technical question, do I need to do extra checks on the dentry before
> > > calling security_path_post_mknod()?
> > 
> > Why do you need the dentry? The two users I see are ima in [1] and evm in 
> > [2].
> > Both of them don't care about the dentry. They only care about the
> > inode. So why is that hook not just:
> 
> Sure, I can definitely do that. Seems an easier fix to do an extra check in
> security_path_post_mknod(), rather than changing the parameter everywhere.

You only have two callers and the generic implementation.

> 
> Next time, when we introduce new LSM hooks we can try to introduce more
> specific parameters.
> 
> Also, consider that the pre hook security_path_mknod() has the dentry as
> parameter. For symmetry, we could keep it in the post hook.

I think that's not that important.

> 
> What I was also asking is if I can still call d_backing_inode() on the
> dentry without extra checks, and avoiding the IS_PRIVATE() check if the
> former returns NULL.

Re: kernel crash in mknod

2024-03-28 Thread Christian Brauner

On Thu, Mar 28, 2024 at 12:53:40PM +0200, Roberto Sassu wrote:
> On 3/26/2024 12:40 PM, Christian Brauner wrote:
> > > we can change the parameter of security_path_post_mknod() from
> > > dentry to inode?
> > 
> > If all current callers only operate on the inode then it seems the best
> > to only pass the inode. If there's some reason someone later needs a
> > dentry the hook can always be changed.
> 
> Ok, so the crash is likely caused by:
> 
> void security_path_post_mknod(struct mnt_idmap *idmap, struct dentry
> *dentry)
> {
> if (unlikely(IS_PRIVATE(d_backing_inode(dentry
> 
> I guess we can also simply check if there is an inode attached to the
> dentry, to minimize the changes. I can do both.
> 
> More technical question, do I need to do extra checks on the dentry before
> calling security_path_post_mknod()?

Why do you need the dentry? The two users I see are ima in [1] and evm in [2].
Both of them don't care about the dentry. They only care about the
inode. So why is that hook not just:

diff --git a/security/security.c b/security/security.c
index 7e118858b545..025689a7e912 100644
--- a/security/security.c
+++ b/security/security.c
@@ -1799,11 +1799,11 @@ EXPORT_SYMBOL(security_path_mknod);
  *
  * Update inode security field after a file has been created.
  */
-void security_path_post_mknod(struct mnt_idmap *idmap, struct dentry *dentry)
+void security_inode_post_mknod(struct mnt_idmap *idmap, struct inode *inode)
 {
-   if (unlikely(IS_PRIVATE(d_backing_inode(dentry
+   if (unlikely(IS_PRIVATE(inode)))
return;
-   call_void_hook(path_post_mknod, idmap, dentry);
+   call_void_hook(path_post_mknod, idmap, inode);
 }

 /**

And one another thing I'd like to point out is that the security hook is
called "security_path_post_mknod()" while the evm and ima hooks are
called evm_post_path_mknod() and ima_post_path_mknod() respectively. In
other words:

git grep _path_post_mknod() doesn't show the implementers of that hook
which is rather unfortunate. It would be better if the pattern were:

_$some_$ordered_$words()

[1]:
static void evm_post_path_mknod(struct mnt_idmap *idmap, struct dentry *dentry)
{
struct inode *inode = d_backing_inode(dentry);
struct evm_iint_cache *iint = evm_iint_inode(inode);

if (!S_ISREG(inode->i_mode))
return;

if (iint)
iint->flags |= EVM_NEW_FILE;
}

[2]:
static void ima_post_path_mknod(struct mnt_idmap *idmap, struct dentry *dentry)
{
struct ima_iint_cache *iint;
struct inode *inode = dentry->d_inode;
int must_appraise;

if (!ima_policy_flag || !S_ISREG(inode->i_mode))
return;

must_appraise = ima_must_appraise(idmap, inode, MAY_ACCESS,
  FILE_CHECK);
if (!must_appraise)
return;

/* Nothing to do if we can't allocate memory */
iint = ima_inode_get(inode);
if (!iint)
return;

/* needed for re-opening empty files */
iint->flags |= IMA_NEW_FILE;
}



> 
> Thanks
> 
> Roberto
> 
> > For bigger changes it's also worthwhile if the object that's passed down
> > into the hook-based LSM layer is as specific as possible. If someone
> > does a change that affects lifetime rules of mounts then any hook that
> > takes a struct path argument that's unused means going through each LSM
> > that implements the hook only to find out it's not actually used.
> > Similar for dentry vs inode imho.
>

Re: kernel crash in mknod

2024-03-26 Thread Christian Brauner

> we can change the parameter of security_path_post_mknod() from
> dentry to inode?

If all current callers only operate on the inode then it seems the best
to only pass the inode. If there's some reason someone later needs a
dentry the hook can always be changed.

For bigger changes it's also worthwhile if the object that's passed down
into the hook-based LSM layer is as specific as possible. If someone
does a change that affects lifetime rules of mounts then any hook that
takes a struct path argument that's unused means going through each LSM
that implements the hook only to find out it's not actually used.
Similar for dentry vs inode imho.

Re: [PATCH][next] fs: Annotate struct file_handle with __counted_by() and use struct_size()

2024-03-26 Thread Christian Brauner

On Mon, 25 Mar 2024 19:34:01 -0600, Gustavo A. R. Silva wrote:
> Prepare for the coming implementation by GCC and Clang of the __counted_by
> attribute. Flexible array members annotated with __counted_by can have
> their accesses bounds-checked at run-time via CONFIG_UBSAN_BOUNDS (for
> array indexing) and CONFIG_FORTIFY_SOURCE (for strcpy/memcpy-family
> functions).
> 
> While there, use struct_size() helper, instead of the open-coded
> version.
> 
> [...]

Applied to the vfs.misc branch of the vfs/vfs.git tree.
Patches in the vfs.misc branch should appear in linux-next soon.

Please report any outstanding bugs that were missed during review in a
new review to the original patch series allowing us to drop it.

It's encouraged to provide Acked-bys and Reviewed-bys even though the
patch has now been applied. If possible patch trailers will be updated.

Note that commit hashes shown below are subject to change due to rebase,
trailer updates or similar. If in doubt, please check the listed branch.

tree:   https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git
branch: vfs.misc

[1/1] fs: Annotate struct file_handle with __counted_by() and use struct_size()
  https://git.kernel.org/vfs/vfs/c/1b43c4629756

Re: kernel crash in mknod

2024-03-25 Thread Christian Brauner

On Sun, Mar 24, 2024 at 04:50:24PM +, Roberto Sassu wrote:
> > From: Al Viro [mailto:v...@ftp.linux.org.uk] On Behalf Of Al Viro
> > Sent: Sunday, March 24, 2024 6:47 AM
> > On Sun, Mar 24, 2024 at 12:00:15AM -0500, Steve French wrote:
> > > Anyone else seeing this kernel crash in do_mknodat (I see it with a
> > > simple "mkfifo" on smb3 mount).  I started seeing this in 6.9-rc (did
> > > not see it in 6.8).   I did not see it with the 3/12/23 mainline
> > > (early in the 6.9-rc merge Window) but I do see it in the 3/22 build
> > > so it looks like the regression was introduced by:
> > 
> > FWIW, successful ->mknod() is allowed to return 0 and unhash
> > dentry, rather than bothering with lookups.  So commit in question
> > is bogus - lack of error does *NOT* mean that you have struct inode
> > existing, let alone attached to dentry.  That kind of behaviour
> > used to be common for network filesystems more than just for ->mknod(),
> > the theory being "if somebody wants to look at it, they can bloody
> > well pay the cost of lookup after dcache miss".
> > 
> > Said that, the language in D/f/vfs.rst is vague as hell and is very easy
> > to misread in direction of "you must instantiate".
> > 
> > Thankfully, there's no counterpart with mkdir - *there* it's not just
> > possible, it's inevitable in some cases for e.g. nfs.
> > 
> > What the hell is that hook doing in non-S_IFREG cases, anyway?  Move it
> > up and be done with it...
> 
> Hi Al
> 
> thanks for the patch. Indeed, it was like that before, when instead of
> an LSM hook there was an IMA call.

Could you please start adding lore links into your commit messages for
all messages that are sent to a mailing list? It really makes tracking
down the original thread a lot easier.

> However, I thought, since we were promoting it as an LSM hook,
> we should be as generic possible, and support more usages than
> what was needed for IMA.

I'm a bit confused now why this is taking a dentry. Nothing in IMA or
EVM cares about the dentry for these hooks so it really should have take
an inode in the first place?

And one minor other question I just realized. Why are some of the new
hooks called security_path_post_mknod() when they aren't actually taking
a path in contrast to say
security_path_{chown,chmod,mknod,chroot,truncate}() that do.

Re: [PATCH v2] statx: stx_subvol

2024-03-12 Thread Christian Brauner

On Mon, Mar 11, 2024 at 11:43:00PM +0100, David Sterba wrote:
> On Mon, Mar 11, 2024 at 02:43:11PM +0100, Christian Brauner wrote:
> > On Mon, Mar 11, 2024 at 08:12:33AM +, Johannes Thumshirn wrote:
> > > On 08.03.24 03:29, Kent Overstreet wrote:
> > > > Add a new statx field for (sub)volume identifiers, as implemented by
> > > > btrfs and bcachefs.
> > > > 
> > > > This includes bcachefs support; we'll definitely want btrfs support as
> > > > well.
> > > 
> > > For btrfs you can add the following:
> > > 
> > > 
> > >  From 82343b7cb2a947bca43234c443b9c22339367f68 Mon Sep 17 00:00:00 2001
> > > From: Johannes Thumshirn 
> > > Date: Mon, 11 Mar 2024 09:09:36 +0100
> > > Subject: [PATCH] btrfs: provide subvolume id for statx
> > > 
> > > Add the inode's subvolume id to the newly proposed statx subvol field.
> > > 
> > > Signed-off-by: Johannes Thumshirn 
> > > ---
> > 
> > Thanks, will fold, once I hear from Josef.
> 
> We're OK with it.

Thanks!

Re: [PATCH v2] statx: stx_subvol

2024-03-12 Thread Christian Brauner

On Mon, Mar 11, 2024 at 04:15:04PM -0400, Kent Overstreet wrote:
> On Mon, Mar 11, 2024 at 02:43:11PM +0100, Christian Brauner wrote:
> > On Mon, Mar 11, 2024 at 08:12:33AM +, Johannes Thumshirn wrote:
> > > On 08.03.24 03:29, Kent Overstreet wrote:
> > > > Add a new statx field for (sub)volume identifiers, as implemented by
> > > > btrfs and bcachefs.
> > > > 
> > > > This includes bcachefs support; we'll definitely want btrfs support as
> > > > well.
> > > 
> > > For btrfs you can add the following:
> > > 
> > > 
> > >  From 82343b7cb2a947bca43234c443b9c22339367f68 Mon Sep 17 00:00:00 2001
> > > From: Johannes Thumshirn 
> > > Date: Mon, 11 Mar 2024 09:09:36 +0100
> > > Subject: [PATCH] btrfs: provide subvolume id for statx
> > > 
> > > Add the inode's subvolume id to the newly proposed statx subvol field.
> > > 
> > > Signed-off-by: Johannes Thumshirn 
> > > ---
> > 
> > Thanks, will fold, once I hear from Josef.
> 
> Can we try to make 6.9? Need to know what to put in the manpage, and
> I've got userspace tooling that wants to use it.

Hm, I understand that you want to see this done but I think it's just
too tight. If this would've been Acked last week then we could've done
it the second week of the merge window. So my reaction is to wait for
v6.10. What do others think?

Re: [PATCH v2] statx: stx_subvol

2024-03-11 Thread Christian Brauner

On Mon, Mar 11, 2024 at 08:12:33AM +, Johannes Thumshirn wrote:
> On 08.03.24 03:29, Kent Overstreet wrote:
> > Add a new statx field for (sub)volume identifiers, as implemented by
> > btrfs and bcachefs.
> > 
> > This includes bcachefs support; we'll definitely want btrfs support as
> > well.
> 
> For btrfs you can add the following:
> 
> 
>  From 82343b7cb2a947bca43234c443b9c22339367f68 Mon Sep 17 00:00:00 2001
> From: Johannes Thumshirn 
> Date: Mon, 11 Mar 2024 09:09:36 +0100
> Subject: [PATCH] btrfs: provide subvolume id for statx
> 
> Add the inode's subvolume id to the newly proposed statx subvol field.
> 
> Signed-off-by: Johannes Thumshirn 
> ---

Thanks, will fold, once I hear from Josef.

Re: [PATCH v2] statx: stx_subvol

2024-03-11 Thread Christian Brauner

On Fri, Mar 08, 2024 at 08:56:33AM -0800, Darrick J. Wong wrote:
> On Fri, Mar 08, 2024 at 11:48:31AM -0500, Kent Overstreet wrote:
> > On Fri, Mar 08, 2024 at 11:44:48AM -0500, Neal Gompa wrote:
> > > On Fri, Mar 8, 2024 at 11:34 AM Kent Overstreet
> > >  wrote:
> > > >
> > > > On Fri, Mar 08, 2024 at 06:42:27AM -0500, Neal Gompa wrote:
> > > > > On Thu, Mar 7, 2024 at 9:29 PM Kent Overstreet
> > > > >  wrote:
> > > > > >
> > > > > > Add a new statx field for (sub)volume identifiers, as implemented by
> > > > > > btrfs and bcachefs.
> > > > > >
> > > > > > This includes bcachefs support; we'll definitely want btrfs support 
> > > > > > as
> > > > > > well.
> > > > > >
> > > > > > Link: 
> > > > > > https://lore.kernel.org/linux-fsdevel/2uvhm6gweyl7iyyp2xpfryvcu2g3padagaeqcbiavjyiis6prl@yjm725bizncq/
> > > > > > Signed-off-by: Kent Overstreet 
> > > > > > Cc: Josef Bacik 
> > > > > > Cc: Miklos Szeredi 
> > > > > > Cc: Christian Brauner 
> > > > > > Cc: David Howells 
> > > > > > Signed-off-by: Kent Overstreet 
> > > > > > ---
> > > > > >  fs/bcachefs/fs.c  | 3 +++
> > > > > >  fs/stat.c | 1 +
> > > > > >  include/linux/stat.h  | 1 +
> > > > > >  include/uapi/linux/stat.h | 4 +++-
> > > > > >  4 files changed, 8 insertions(+), 1 deletion(-)
> > > > > >
> > > > > > diff --git a/fs/bcachefs/fs.c b/fs/bcachefs/fs.c
> > > > > > index 3f073845bbd7..6a542ed43e2c 100644
> > > > > > --- a/fs/bcachefs/fs.c
> > > > > > +++ b/fs/bcachefs/fs.c
> > > > > > @@ -840,6 +840,9 @@ static int bch2_getattr(struct mnt_idmap *idmap,
> > > > > > stat->blksize   = block_bytes(c);
> > > > > > stat->blocks= inode->v.i_blocks;
> > > > > >
> > > > > > +   stat->subvol= inode->ei_subvol;
> > > > > > +   stat->result_mask |= STATX_SUBVOL;
> > > > > > +
> > > > > > if (request_mask & STATX_BTIME) {
> > > > > > stat->result_mask |= STATX_BTIME;
> > > > > > stat->btime = bch2_time_to_timespec(c, 
> > > > > > inode->ei_inode.bi_otime);
> > > > > > diff --git a/fs/stat.c b/fs/stat.c
> > > > > > index 77cdc69eb422..70bd3e888cfa 100644
> > > > > > --- a/fs/stat.c
> > > > > > +++ b/fs/stat.c
> > > > > > @@ -658,6 +658,7 @@ cp_statx(const struct kstat *stat, struct statx 
> > > > > > __user *buffer)
> > > > > > tmp.stx_mnt_id = stat->mnt_id;
> > > > > > tmp.stx_dio_mem_align = stat->dio_mem_align;
> > > > > > tmp.stx_dio_offset_align = stat->dio_offset_align;
> > > > > > +   tmp.stx_subvol = stat->subvol;
> > > > > >
> > > > > > return copy_to_user(buffer, , sizeof(tmp)) ? -EFAULT : 
> > > > > > 0;
> > > > > >  }
> > > > > > diff --git a/include/linux/stat.h b/include/linux/stat.h
> > > > > > index 52150570d37a..bf92441dbad2 100644
> > > > > > --- a/include/linux/stat.h
> > > > > > +++ b/include/linux/stat.h
> > > > > > @@ -53,6 +53,7 @@ struct kstat {
> > > > > > u32 dio_mem_align;
> > > > > > u32 dio_offset_align;
> > > > > > u64 change_cookie;
> > > > > > +   u64 subvol;
> > > > > >  };
> > > > > >
> > > > > >  /* These definitions are internal to the kernel for now. Mainly 
> > > > > > used by nfsd. */
> > > > > > diff --git a/include/uapi/linux/stat.h b/include/uapi/linux/stat.h
> > > > > > index 2f2ee82d5517..67626d535316 100644
> > > > > > --- a/include/uapi/linux/stat.h
> > > > > > +++ b/include/uapi/linux/stat.h
> > > > > > @@ -126,8 +126,9 @@ struct statx {
> > > > > > __u64   stx_mnt_id;
> > > > > > __u32   stx_

Re: [PATCH] statx: stx_vol

2024-03-08 Thread Christian Brauner

On Thu, Mar 07, 2024 at 12:39:58PM -0500, Kent Overstreet wrote:
> On Mon, Mar 04, 2024 at 10:18:22AM +0100, Christian Brauner wrote:
> > On Sat, Mar 02, 2024 at 05:02:03PM -0500, Kent Overstreet wrote:
> > > Add a new statx field for (sub)volume identifiers.
> > > 
> > > This includes bcachefs support; we'll definitely want btrfs support as
> > > well.
> > > 
> > > Link: 
> > > https://lore.kernel.org/linux-fsdevel/2uvhm6gweyl7iyyp2xpfryvcu2g3padagaeqcbiavjyiis6prl@yjm725bizncq/
> > > Signed-off-by: Kent Overstreet 
> > > Cc: Josef Bacik 
> > > Cc: Miklos Szeredi 
> > > Cc: Christian Brauner 
> > > Cc: David Howells 
> > > ---
> > 
> > As I've said many times before I'm supportive of this and would pick up
> > a patch like this. There's definitely a lot of userspace that would make
> > use of this that I'm aware of. If the btrfs people could provide an Ack
> > on this to express their support here that would be great.
> > 
> > And it would be lovely if we could expand the commit message a bit and
> > do some renaming/bikeshedding. Imho, STATX_SUBVOLUME_ID is great and
> > then stx_subvolume_id or stx_subvol_id. And then subvolume_id or
> > subvol_id for the field in struct kstat.
> 
> _id is too redundant for me, can we just do STATX_SUBVOL/statx.subvol?

Fine by me.

Re: [PATCH] erofs: fix lockdep false positives on initializing erofs_pseudo_mnt

2024-03-07 Thread Christian Brauner

On Thu, Mar 07, 2024 at 12:18:52PM +0800, Gao Xiang wrote:
> Hi,
> 
> (try to +Cc Christian and Al here...)
> 
> On 2024/3/7 11:41, Jingbo Xu wrote:
> > Hi Baokun,
> > 
> > Thanks for catching this!
> > 
> > 
> > On 3/7/24 10:52 AM, Gao Xiang wrote:
> > > Hi Baokun,
> > > 
> > > On 2024/3/7 10:44, Baokun Li wrote:
> > > > Lockdep reported the following issue when mounting erofs with a
> > > > domain_id:
> > > > 
> > > > 
> > > > WARNING: possible recursive locking detected
> > > > 6.8.0-rc7-xfstests #521 Not tainted
> > > > 
> > > > mount/396 is trying to acquire lock:
> > > > 907a80e0 (>s_umount_key#50/1){+.+.}-{3:3},
> > > >      at: alloc_super+0xe3/0x3d0
> > > > 
> > > > but task is already holding lock:
> > > > 907a8aaa90e0 (>s_umount_key#50/1){+.+.}-{3:3},
> > > >      at: alloc_super+0xe3/0x3d0
> > > > 
> > > > other info that might help us debug this:
> > > >    Possible unsafe locking scenario:
> > > > 
> > > >      CPU0
> > > >      
> > > >     lock(>s_umount_key#50/1);
> > > >     lock(>s_umount_key#50/1);
> > > > 
> > > >    *** DEADLOCK ***
> > > > 
> > > >    May be due to missing lock nesting notation
> > > > 
> > > > 2 locks held by mount/396:
> > > >    #0: 907a8aaa90e0 (>s_umount_key#50/1){+.+.}-{3:3},
> > > >      at: alloc_super+0xe3/0x3d0
> > > >    #1: c00e6f28 (erofs_domain_list_lock){+.+.}-{3:3},
> > > >      at: erofs_fscache_register_fs+0x3d/0x270 [erofs]
> > > > 
> > > > stack backtrace:
> > > > CPU: 1 PID: 396 Comm: mount Not tainted 6.8.0-rc7-xfstests #521
> > > > Call Trace:
> > > >    
> > > >    dump_stack_lvl+0x64/0xb0
> > > >    validate_chain+0x5c4/0xa00
> > > >    __lock_acquire+0x6a9/0xd50
> > > >    lock_acquire+0xcd/0x2b0
> > > >    down_write_nested+0x45/0xd0
> > > >    alloc_super+0xe3/0x3d0
> > > >    sget_fc+0x62/0x2f0
> > > >    vfs_get_super+0x21/0x90
> > > >    vfs_get_tree+0x2c/0xf0
> > > >    fc_mount+0x12/0x40
> > > >    vfs_kern_mount.part.0+0x75/0x90
> > > >    kern_mount+0x24/0x40
> > > >    erofs_fscache_register_fs+0x1ef/0x270 [erofs]
> > > >    erofs_fc_fill_super+0x213/0x380 [erofs]
> > > > 
> > > > This is because the file_system_type of both erofs and the pseudo-mount
> > > > point of domain_id is erofs_fs_type, so two successive calls to
> > > > alloc_super() are considered to be using the same lock and trigger the
> > > > warning above.
> > > > 
> > > > Therefore add a nodev file_system_type named erofs_anon_fs_type to
> > > > silence this complaint. In addition, to reduce code coupling, refactor
> > > > out the erofs_anon_init_fs_context() and erofs_kill_pseudo_sb() 
> > > > functions
> > > > and move the erofs_pseudo_mnt related code to fscache.c.
> > > > 
> > > > Signed-off-by: Baokun Li 
> > > 
> > > IMHO, in the beginning, I'd like to avoid introducing another fs type
> > > for erofs to share (meta)data between filesystems since it will cause
> > > churn, could we use some alternative way to resolve this?
> > 
> > Yeah as Gao Xiang said, this is initially intended to avoid introducing
> > anothoer file_system_type, say erofs_anon_fs_type.
> > 
> > What we need is actually a method of allocating anonymous inode as a
> > sentinel identifying each blob.  There is indeed a global mount, i.e.
> > anon_inode_mnt, for allocating anonymous inode/file specifically.  At
> > the time the share domain feature is introduced, there's only one
> > anonymous inode, i.e. anon_inode_inode, and all the allocated anonymous
> > files are bound to this single anon_inode_inode.  Thus we decided to
> > implement a erofs internal pseudo mount for this usage.
> > 
> > But I noticed that we can now allocate unique anonymous inodes from
> > anon_inode_mnt since commit e7e832c ("fs: add LSM-supporting anon-inode
> > interface"), though the new interface is initially for LSM usage.
> 
> Yes, as summary, EROFS now maintains a bunch of anon inodes among
> all different filesystem instances, so that like
> 
> blob sharing or
> page cache sharing across filesystems can be done.
> 
> In brief, I think the following patch is a good idea but it
> hasn't been landed until now:
> https://lore.kernel.org/r/20210309155348.974875-3-...@lst.de
> 
> Other than that, is it a good idea to introduce another fs type
> (like erofs_anon_fs_type) for such usage?

It depends. If you're allocating a lot of inodes then having a separate
filesystem type for erofs makes sense. If it's just a few then it
probably doesn't matter. If you need custom inode operations for these
anonymous inodes then it also makes sense to have a separate filesystem
type.

Re: [PATCH v2 24/25] commoncap: use vfs fscaps interfaces

2024-03-05 Thread Christian Brauner

On Tue, Mar 05, 2024 at 01:46:56PM +0100, Roberto Sassu wrote:
> On Tue, 2024-03-05 at 10:12 +0100, Christian Brauner wrote:
> > On Mon, Mar 04, 2024 at 10:56:17AM -0600, Seth Forshee (DigitalOcean) wrote:
> > > On Mon, Mar 04, 2024 at 05:17:57PM +0100, Roberto Sassu wrote:
> > > > On Mon, 2024-03-04 at 09:31 -0600, Seth Forshee (DigitalOcean) wrote:
> > > > > On Mon, Mar 04, 2024 at 11:19:54AM +0100, Roberto Sassu wrote:
> > > > > > On Wed, 2024-02-21 at 15:24 -0600, Seth Forshee (DigitalOcean) 
> > > > > > wrote:
> > > > > > > Use the vfs interfaces for fetching file capabilities for killpriv
> > > > > > > checks and from get_vfs_caps_from_disk(). While there, update the
> > > > > > > kerneldoc for get_vfs_caps_from_disk() to explain how it is 
> > > > > > > different
> > > > > > > from vfs_get_fscaps_nosec().
> > > > > > > 
> > > > > > > Signed-off-by: Seth Forshee (DigitalOcean) 
> > > > > > > ---
> > > > > > >  security/commoncap.c | 30 +-
> > > > > > >  1 file changed, 13 insertions(+), 17 deletions(-)
> > > > > > > 
> > > > > > > diff --git a/security/commoncap.c b/security/commoncap.c
> > > > > > > index a0ff7e6092e0..751bb26a06a6 100644
> > > > > > > --- a/security/commoncap.c
> > > > > > > +++ b/security/commoncap.c
> > > > > > > @@ -296,11 +296,12 @@ int cap_capset(struct cred *new,
> > > > > > >   */
> > > > > > >  int cap_inode_need_killpriv(struct dentry *dentry)
> > > > > > >  {
> > > > > > > - struct inode *inode = d_backing_inode(dentry);
> > > > > > > + struct vfs_caps caps;
> > > > > > >   int error;
> > > > > > >  
> > > > > > > - error = __vfs_getxattr(dentry, inode, XATTR_NAME_CAPS, NULL, 0);
> > > > > > > - return error > 0;
> > > > > > > + /* Use nop_mnt_idmap for no mapping here as mapping is 
> > > > > > > unimportant */
> > > > > > > + error = vfs_get_fscaps_nosec(_mnt_idmap, dentry, );
> > > > > > > + return error == 0;
> > > > > > >  }
> > > > > > >  
> > > > > > >  /**
> > > > > > > @@ -323,7 +324,7 @@ int cap_inode_killpriv(struct mnt_idmap 
> > > > > > > *idmap, struct dentry *dentry)
> > > > > > >  {
> > > > > > >   int error;
> > > > > > >  
> > > > > > > - error = __vfs_removexattr(idmap, dentry, XATTR_NAME_CAPS);
> > > > > > > + error = vfs_remove_fscaps_nosec(idmap, dentry);
> > > > > > 
> > > > > > Uhm, I see that the change is logically correct... but the original
> > > > > > code was not correct, since the EVM post hook is not called (thus 
> > > > > > the
> > > > > > HMAC is broken, or an xattr change is allowed on a portable 
> > > > > > signature
> > > > > > which should be not).
> > > > > > 
> > > > > > For completeness, the xattr change on a portable signature should 
> > > > > > not
> > > > > > happen in the first place, so cap_inode_killpriv() would not be 
> > > > > > called.
> > > > > > However, since EVM allows same value change, we are here.
> > > > > 
> > > > > I really don't understand EVM that well and am pretty hesitant to try 
> > > > > an
> > > > > change any of the logic around it. But I'll hazard a thought: should 
> > > > > EVM
> > > > > have a inode_need_killpriv hook which returns an error in this
> > > > > situation?
> > > > 
> > > > Uhm, I think it would not work without modifying
> > > > security_inode_need_killpriv() and the hook definition.
> > > > 
> > > > Since cap_inode_need_killpriv() returns 1, the loop stops and EVM would
> > > > not be invoked. We would need to continue the loop and let EVM know
> > > > what is the current return value. Then EVM can reject the change.
> > > > 
> > > > An alternative way would be to detect that actually we are setting the
> > > > same value for inode metadata, and maybe not returning 1 from
> > > > cap_inode_need_killpriv().
> > > > 
> > > > I would prefer the second, since EVM allows same value change and we
> > > > would have an exception if there are fscaps.
> > > > 
> > > > This solves only the case of portable signatures. We would need to
> > > > change cap_inode_need_killpriv() anyway to update the HMAC for mutable
> > > > files.
> > > 
> > > I see. In any case this sounds like a matter for a separate patch
> > > series.
> > 
> > Agreed.
> 
> Christian, how realistic is that we don't kill priv if we are setting
> the same owner?

Uhm, I would need to see the wider context of the proposed change. But
iiuc then you would be comparing current and new fscaps and if they are
identical you don't kill privs? I think that would work. But again, I
would need to see the actual context/change to say something meaningful.

Re: [PATCH] xattr: restrict vfs_getxattr_alloc() allocation size

2024-03-05 Thread Christian Brauner

On Tue, 05 Mar 2024 13:27:06 +0100, Christian Brauner wrote:
> The vfs_getxattr_alloc() interface is a special-purpose in-kernel api
> that does a racy query-size+allocate-buffer+retrieve-data. It is used by
> EVM, IMA, and fscaps to retrieve xattrs. Recently, we've seen issues
> where 9p returned values that amount to allocating about 8000GB worth of
> memory (cf. [1]). That's now fixed in 9p. But vfs_getxattr_alloc() has
> no reason to allow getting xattr values that are larger than
> XATTR_MAX_SIZE as that's the limit we use for setting and getting xattr
> values and nothing currently goes beyond that limit afaict. Let it check
> for that and reject requests that are larger than that.
> 
> [...]

Applied to the vfs.misc branch of the vfs/vfs.git tree.
Patches in the vfs.misc branch should appear in linux-next soon.

Please report any outstanding bugs that were missed during review in a
new review to the original patch series allowing us to drop it.

It's encouraged to provide Acked-bys and Reviewed-bys even though the
patch has now been applied. If possible patch trailers will be updated.

Note that commit hashes shown below are subject to change due to rebase,
trailer updates or similar. If in doubt, please check the listed branch.

tree:   https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git
branch: vfs.misc

[1/1] xattr: restrict vfs_getxattr_alloc() allocation size
  https://git.kernel.org/vfs/vfs/c/82a4c8736d72

[PATCH] xattr: restrict vfs_getxattr_alloc() allocation size

2024-03-05 Thread Christian Brauner

The vfs_getxattr_alloc() interface is a special-purpose in-kernel api
that does a racy query-size+allocate-buffer+retrieve-data. It is used by
EVM, IMA, and fscaps to retrieve xattrs. Recently, we've seen issues
where 9p returned values that amount to allocating about 8000GB worth of
memory (cf. [1]). That's now fixed in 9p. But vfs_getxattr_alloc() has
no reason to allow getting xattr values that are larger than
XATTR_MAX_SIZE as that's the limit we use for setting and getting xattr
values and nothing currently goes beyond that limit afaict. Let it check
for that and reject requests that are larger than that.

Link: https://lore.kernel.org/r/ZeXcQmHWcYvfCR93@do-x1extreme [1]
Signed-off-by: Christian Brauner 
---
 fs/xattr.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/fs/xattr.c b/fs/xattr.c
index 09d927603433..a53c930e3018 100644
--- a/fs/xattr.c
+++ b/fs/xattr.c
@@ -395,6 +395,9 @@ vfs_getxattr_alloc(struct mnt_idmap *idmap, struct dentry 
*dentry,
if (error < 0)
return error;
 
+   if (error > XATTR_SIZE_MAX)
+   return -E2BIG;
+
if (!value || (error > xattr_size)) {
value = krealloc(*xattr_value, error + 1, flags);
if (!value)
-- 
2.43.0

Re: [PATCH v2 24/25] commoncap: use vfs fscaps interfaces

2024-03-05 Thread Christian Brauner

On Mon, Mar 04, 2024 at 10:56:17AM -0600, Seth Forshee (DigitalOcean) wrote:
> On Mon, Mar 04, 2024 at 05:17:57PM +0100, Roberto Sassu wrote:
> > On Mon, 2024-03-04 at 09:31 -0600, Seth Forshee (DigitalOcean) wrote:
> > > On Mon, Mar 04, 2024 at 11:19:54AM +0100, Roberto Sassu wrote:
> > > > On Wed, 2024-02-21 at 15:24 -0600, Seth Forshee (DigitalOcean) wrote:
> > > > > Use the vfs interfaces for fetching file capabilities for killpriv
> > > > > checks and from get_vfs_caps_from_disk(). While there, update the
> > > > > kerneldoc for get_vfs_caps_from_disk() to explain how it is different
> > > > > from vfs_get_fscaps_nosec().
> > > > > 
> > > > > Signed-off-by: Seth Forshee (DigitalOcean) 
> > > > > ---
> > > > >  security/commoncap.c | 30 +-
> > > > >  1 file changed, 13 insertions(+), 17 deletions(-)
> > > > > 
> > > > > diff --git a/security/commoncap.c b/security/commoncap.c
> > > > > index a0ff7e6092e0..751bb26a06a6 100644
> > > > > --- a/security/commoncap.c
> > > > > +++ b/security/commoncap.c
> > > > > @@ -296,11 +296,12 @@ int cap_capset(struct cred *new,
> > > > >   */
> > > > >  int cap_inode_need_killpriv(struct dentry *dentry)
> > > > >  {
> > > > > - struct inode *inode = d_backing_inode(dentry);
> > > > > + struct vfs_caps caps;
> > > > >   int error;
> > > > >  
> > > > > - error = __vfs_getxattr(dentry, inode, XATTR_NAME_CAPS, NULL, 0);
> > > > > - return error > 0;
> > > > > + /* Use nop_mnt_idmap for no mapping here as mapping is 
> > > > > unimportant */
> > > > > + error = vfs_get_fscaps_nosec(_mnt_idmap, dentry, );
> > > > > + return error == 0;
> > > > >  }
> > > > >  
> > > > >  /**
> > > > > @@ -323,7 +324,7 @@ int cap_inode_killpriv(struct mnt_idmap *idmap, 
> > > > > struct dentry *dentry)
> > > > >  {
> > > > >   int error;
> > > > >  
> > > > > - error = __vfs_removexattr(idmap, dentry, XATTR_NAME_CAPS);
> > > > > + error = vfs_remove_fscaps_nosec(idmap, dentry);
> > > > 
> > > > Uhm, I see that the change is logically correct... but the original
> > > > code was not correct, since the EVM post hook is not called (thus the
> > > > HMAC is broken, or an xattr change is allowed on a portable signature
> > > > which should be not).
> > > > 
> > > > For completeness, the xattr change on a portable signature should not
> > > > happen in the first place, so cap_inode_killpriv() would not be called.
> > > > However, since EVM allows same value change, we are here.
> > > 
> > > I really don't understand EVM that well and am pretty hesitant to try an
> > > change any of the logic around it. But I'll hazard a thought: should EVM
> > > have a inode_need_killpriv hook which returns an error in this
> > > situation?
> > 
> > Uhm, I think it would not work without modifying
> > security_inode_need_killpriv() and the hook definition.
> > 
> > Since cap_inode_need_killpriv() returns 1, the loop stops and EVM would
> > not be invoked. We would need to continue the loop and let EVM know
> > what is the current return value. Then EVM can reject the change.
> > 
> > An alternative way would be to detect that actually we are setting the
> > same value for inode metadata, and maybe not returning 1 from
> > cap_inode_need_killpriv().
> > 
> > I would prefer the second, since EVM allows same value change and we
> > would have an exception if there are fscaps.
> > 
> > This solves only the case of portable signatures. We would need to
> > change cap_inode_need_killpriv() anyway to update the HMAC for mutable
> > files.
> 
> I see. In any case this sounds like a matter for a separate patch
> series.

Agreed.

Re: [PATCH] statx: stx_vol

2024-03-04 Thread Christian Brauner

On Sat, Mar 02, 2024 at 05:02:03PM -0500, Kent Overstreet wrote:
> Add a new statx field for (sub)volume identifiers.
> 
> This includes bcachefs support; we'll definitely want btrfs support as
> well.
> 
> Link: 
> https://lore.kernel.org/linux-fsdevel/2uvhm6gweyl7iyyp2xpfryvcu2g3padagaeqcbiavjyiis6prl@yjm725bizncq/
> Signed-off-by: Kent Overstreet 
> Cc: Josef Bacik 
> Cc: Miklos Szeredi 
> Cc: Christian Brauner 
> Cc: David Howells 
> ---

As I've said many times before I'm supportive of this and would pick up
a patch like this. There's definitely a lot of userspace that would make
use of this that I'm aware of. If the btrfs people could provide an Ack
on this to express their support here that would be great.

And it would be lovely if we could expand the commit message a bit and
do some renaming/bikeshedding. Imho, STATX_SUBVOLUME_ID is great and
then stx_subvolume_id or stx_subvol_id. And then subvolume_id or
subvol_id for the field in struct kstat.

>  fs/bcachefs/fs.c  | 3 +++
>  fs/stat.c | 1 +
>  include/linux/stat.h  | 1 +
>  include/uapi/linux/stat.h | 4 +++-
>  4 files changed, 8 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/bcachefs/fs.c b/fs/bcachefs/fs.c
> index 3f073845bbd7..d82f7f3f0670 100644
> --- a/fs/bcachefs/fs.c
> +++ b/fs/bcachefs/fs.c
> @@ -840,6 +840,9 @@ static int bch2_getattr(struct mnt_idmap *idmap,
>   stat->blksize   = block_bytes(c);
>   stat->blocks= inode->v.i_blocks;
>  
> + stat->vol   = inode->ei_subvol;
> + stat->result_mask |= STATX_VOL;
> +
>   if (request_mask & STATX_BTIME) {
>   stat->result_mask |= STATX_BTIME;
>   stat->btime = bch2_time_to_timespec(c, 
> inode->ei_inode.bi_otime);
> diff --git a/fs/stat.c b/fs/stat.c
> index 77cdc69eb422..80d5f7502d99 100644
> --- a/fs/stat.c
> +++ b/fs/stat.c
> @@ -658,6 +658,7 @@ cp_statx(const struct kstat *stat, struct statx __user 
> *buffer)
>   tmp.stx_mnt_id = stat->mnt_id;
>   tmp.stx_dio_mem_align = stat->dio_mem_align;
>   tmp.stx_dio_offset_align = stat->dio_offset_align;
> + tmp.stx_vol = stat->vol;
>  
>   return copy_to_user(buffer, , sizeof(tmp)) ? -EFAULT : 0;
>  }
> diff --git a/include/linux/stat.h b/include/linux/stat.h
> index 52150570d37a..9dc1b493ef1f 100644
> --- a/include/linux/stat.h
> +++ b/include/linux/stat.h
> @@ -53,6 +53,7 @@ struct kstat {
>   u32 dio_mem_align;
>   u32 dio_offset_align;
>   u64 change_cookie;
> + u64 vol;
>  };
>  
>  /* These definitions are internal to the kernel for now. Mainly used by 
> nfsd. */
> diff --git a/include/uapi/linux/stat.h b/include/uapi/linux/stat.h
> index 2f2ee82d5517..ae090d67946d 100644
> --- a/include/uapi/linux/stat.h
> +++ b/include/uapi/linux/stat.h
> @@ -126,8 +126,9 @@ struct statx {
>   __u64   stx_mnt_id;
>   __u32   stx_dio_mem_align;  /* Memory buffer alignment for direct 
> I/O */
>   __u32   stx_dio_offset_align;   /* File offset alignment for direct I/O 
> */
> + __u64   stx_vol;/* Subvolume identifier */
>   /* 0xa0 */
> - __u64   __spare3[12];   /* Spare space for future expansion */
> + __u64   __spare3[11];   /* Spare space for future expansion */
>   /* 0x100 */
>  };
>  
> @@ -155,6 +156,7 @@ struct statx {
>  #define STATX_MNT_ID 0x1000U /* Got stx_mnt_id */
>  #define STATX_DIOALIGN   0x2000U /* Want/got direct I/O 
> alignment info */
>  #define STATX_MNT_ID_UNIQUE  0x4000U /* Want/got extended 
> stx_mount_id */
> +#define STATX_VOL0x8000U /* Want/got stx_vol */
>  
>  #define STATX__RESERVED  0x8000U /* Reserved for future 
> struct statx expansion */
>  
> -- 
> 2.43.0
>

Re: [PATCH v2 14/25] evm: add support for fscaps security hooks

2024-03-01 Thread Christian Brauner

> I have seen this policy of adding tests in other subsystems (eBPF),

It makes sense if the drive of the patchset would be IMA/EVM features
not refactoring of existing code.

> Happy to try adding the tests, would appreciate your help to review if

Cool, happy to help review them.

Re: [PATCH v2 14/25] evm: add support for fscaps security hooks

2024-03-01 Thread Christian Brauner

On Fri, Mar 01, 2024 at 10:19:13AM +0100, Roberto Sassu wrote:
> On Wed, 2024-02-21 at 15:24 -0600, Seth Forshee (DigitalOcean) wrote:
> > Support the new fscaps security hooks by converting the vfs_caps to raw
> > xattr data and then handling them the same as other xattrs.
> 
> Hi Seth
> 
> I started looking at this patch set.
> 
> The first question I have is if you are also going to update libcap
> (and also tar, I guess), since both deal with the raw xattr.
> 
> From IMA/EVM perspective (Mimi will add on that), I guess it is
> important that files with a signature/HMAC continue to be accessible
> after applying this patch set.
> 
> Looking at the code, it seems the case (if I understood correctly,
> vfs_getxattr_alloc() is still allowed).
> 
> To be sure that everything works, it would be really nice if you could
> also extend our test suite:
> 
> https://github.com/mimizohar/ima-evm-utils/blob/next-testing/tests/portable_signatures.test
> 
> and
> 
> https://github.com/mimizohar/ima-evm-utils/blob/next-testing/tests/evm_hmac.test
> 
> 
> The first test we would need to extend is check_cp_preserve_xattrs,
> which basically does a cp -a. We would need to set fscaps in the
> origin, copy to the destination, and see if the latter is accessible.
> 
> I would also extend:
> 
> check_tar_extract_xattrs_different_owner
> check_tar_extract_xattrs_same_owner
> check_metadata_change
> check_evm_revalidate
> check_evm_portable_sig_ima_appraisal
> check_evm_portable_sig_ima_measurement_list
> 
> It should not be too complicated. The purpose would be to exercise your
> code below.
> 
> 
> Regarding the second test, we would need to extend just check_evm_hmac.
> 
> 
> Just realized, before extending the tests, it would be necessary to
> modify also evmctl.c, to retrieve fscaps through the new interfaces,
> and to let users provide custom fscaps the HMAC or portable signature
> is calculated on.

While request for tests are obviously fine they should be added by the
respective experts for IMA/EVM in this case. I don't think it's
appropriate to expect Seth to do that especially because you seem to
imply that you currently don't have any tests for fscaps at all. We're
always happy to test things and if that'd be adding new IMA/EVM specific
features than it would be something to discuss but really we're
refactoring so the fact that you don't have tests we can run is not the
fault of this patchset and IMA/EVM is just a small portion of it.

Re: [PATCH 2/2] bcachefs: Buffered write path now can avoid the inode lock

2024-02-29 Thread Christian Brauner

On Wed, Feb 28, 2024 at 11:27:05PM -0800, Linus Torvalds wrote:
> On Wed, 28 Feb 2024 at 23:20, Linus Torvalds
>  wrote:
> >
> >
> >  - take the lock exclusively if O_APPEND or if it *looks* like you
> > might extend the file size.
> >
> >  - otherwise, take the shared lock, and THEN RE-CHECK. The file size
> > might have changed, so now you need to double-check that you're really
> > not going to extend the size of the file, and if you are, you need to
> > go back and take the inode lock exclusively after all.
> 
> Same goes for the suid/sgid checks. You need to take the inode lock
> shared to even have the i_mode be stable, and at that point you might
> decide "oh, I need to clear suid/sgid after all" and have to go drop
> the lock and retake it for exclusive access after all.

I agree. And this is how we've done it in xfs as well. The inode lock is
taken shared, then check for IS_NOSEC(inode) and if that's true keep the
shared lock, otherwise upgrade to an exclusive lock. Note that this
whole check also checks filesystem capability xattrs.

I think you'd also get potential security issues or at least very subtle
behavior. Someone could make a binary setuid and a concurrent write
happens. chmod is taking the exclusive lock and there's a concurrent
write going on changing the contents to something unsafe without being
forced to remove the set*id bits.

Similar for {g,u}id changes. Someone starts a chown() taking inode lock
while someone issues a concurrent write. The chown changes the group
ownership so that a writer would be forced to drop the sgid bit. But the
writer proceeds keeping that bit.

So that's all potentially pretty subtle side-effects we'd allow. And
I've removed so many bugs in that area the last couple of years as well.
So let's please use the shared lock.

Re: [PATCH v2 20/25] ovl: add fscaps handlers

2024-02-23 Thread Christian Brauner

On Wed, Feb 21, 2024 at 03:24:51PM -0600, Seth Forshee (DigitalOcean) wrote:
> Add handlers which read fs caps from the lower or upper filesystem and
> write/remove fs caps to the upper filesystem, performing copy-up as
> necessary.
> 
> While fscaps only really make sense on regular files, the general policy
> is to allow most xattr namespaces on all different inode types, so
> fscaps handlers are installed in the inode operations for all types of
> inodes.
> 
> Signed-off-by: Seth Forshee (DigitalOcean) 
> ---
>  fs/overlayfs/dir.c   |  2 ++
>  fs/overlayfs/inode.c | 72 
> 
>  fs/overlayfs/overlayfs.h |  5 
>  3 files changed, 79 insertions(+)
> 
> diff --git a/fs/overlayfs/dir.c b/fs/overlayfs/dir.c
> index 0f8b4a719237..4ff360fe10c9 100644
> --- a/fs/overlayfs/dir.c
> +++ b/fs/overlayfs/dir.c
> @@ -1307,6 +1307,8 @@ const struct inode_operations ovl_dir_inode_operations 
> = {
>   .get_inode_acl  = ovl_get_inode_acl,
>   .get_acl= ovl_get_acl,
>   .set_acl= ovl_set_acl,
> + .get_fscaps = ovl_get_fscaps,
> + .set_fscaps = ovl_set_fscaps,
>   .update_time= ovl_update_time,
>   .fileattr_get   = ovl_fileattr_get,
>   .fileattr_set   = ovl_fileattr_set,
> diff --git a/fs/overlayfs/inode.c b/fs/overlayfs/inode.c
> index c63b31a460be..7a8978ea6fe1 100644
> --- a/fs/overlayfs/inode.c
> +++ b/fs/overlayfs/inode.c
> @@ -568,6 +568,72 @@ int ovl_set_acl(struct mnt_idmap *idmap, struct dentry 
> *dentry,
>  }
>  #endif
>  
> +int ovl_get_fscaps(struct mnt_idmap *idmap, struct dentry *dentry,
> +struct vfs_caps *caps)
> +{
> + int err;
> + const struct cred *old_cred;
> + struct path realpath;
> +
> + ovl_path_real(dentry, );
> + old_cred = ovl_override_creds(dentry->d_sb);
> + err = vfs_get_fscaps(mnt_idmap(realpath.mnt), realpath.dentry, caps);

Right, vfs_get_fscaps() returns a struct vfs_caps which contains a
vfs{g,u}id and has the lower/upper layer's idmap taken into account.

That confused me at first because vfs_get_acl() returns a struct
posix_acl which contains k{g,u}id.

Reading through this made me realize that we need a few more words about
the translations. The reason is that we do distinct things for POSIX
ACLs and for fscaps. For POSIX ACLs when we call vfs_get_acl() what we
get is a struct posix_acl which contains k{g,u}id_t types. Because
struct posix_acl is cached filesytems wide and thus shared among
concurrent retrievers from different mounts with different idmappings.
Which means that we can't put vfs{g,u}id_t types in there. Instead we
perform translations on the fly. We do that in the VFS during path
lookup and we do that for overlayfs when it retrieves POSIX ACLs.

However, for fscaps we seem to do it differently because they're not
cached which is ok because they don't matter during path lookup as POSIX
ACLs do. So performance here doesn't matter too much. But that means
overall that the translations are quite distinct. And that gets
confusing when we have a stacking filesystem in the mix where we have to
take into account the privileges of the mounter of the overlayfs
instance and the idmap of the lower/upper layer.

I only skimmed my old commit but I think that commit 0c5fd887d2bb ("acl: move
idmapped mount fixup into vfs_{g,s}etxattr()") contains a detailed explanation
of this as I see:

> For POSIX ACLs we need to do something similar. However, in contrast to 
fscaps
> we cannot apply the fix directly to the kernel internal posix acl data
> structure as this would alter the cached values and would also require a 
rework
> of how we currently deal with POSIX ACLs in general which almost never 
take the
> filesystem idmapping into account (the noteable exception being FUSE but 
even
> there the implementation is special) and instead retrieve the raw values 
based
> on the initial idmapping.

Could you please add a diagram/explanation illustrating the translations for
fscaps in the general case and for stacking filesystems? It doesn't really
matter too much where you put it. Either add a section to
Documentation/filesystems/porting.rst or add a section to
Documentation/filesystems/idmapping.rst.

> + revert_creds(old_cred);
> + return err;
> +}
> +
> +int ovl_set_fscaps(struct mnt_idmap *idmap, struct dentry *dentry,
> +const struct vfs_caps *caps, int setxattr_flags)
> +{
> + int err;
> + struct ovl_fs *ofs = OVL_FS(dentry->d_sb);
> + struct dentry *upperdentry = ovl_dentry_upper(dentry);
> + struct dentry *realdentry = upperdentry ?: ovl_dentry_lower(dentry);
> + const struct cred *old_cred;
> +
> + /*
> +  * If the fscaps are to be remove from a lower file, check that they
> +  * exist before copying up.
> +  */
> + if (!caps && !upperdentry) {
> + struct path realpath;
> + struct vfs_caps lower_caps;
> +
> +

Re: [PATCH v2 19/25] fs: add vfs_remove_fscaps()

2024-02-23 Thread Christian Brauner

On Wed, Feb 21, 2024 at 03:24:50PM -0600, Seth Forshee (DigitalOcean) wrote:
> Provide a type-safe interface for removing filesystem capabilities and a
> generic implementation suitable for most filesystems. Also add an
> internal interface, vfs_remove_fscaps_nosec(), which is called with the
> inode lock held and skips security checks for later use from the
> capability code.
> 
> Signed-off-by: Seth Forshee (DigitalOcean) 
> ---
>  fs/xattr.c | 81 
> ++
>  include/linux/fs.h |  2 ++
>  2 files changed, 83 insertions(+)
> 
> diff --git a/fs/xattr.c b/fs/xattr.c
> index 96de43928a51..8b0f7384cbc9 100644
> --- a/fs/xattr.c
> +++ b/fs/xattr.c
> @@ -324,6 +324,87 @@ int vfs_set_fscaps(struct mnt_idmap *idmap, struct 
> dentry *dentry,
>  }
>  EXPORT_SYMBOL(vfs_set_fscaps);
>  
> +static int generic_remove_fscaps(struct mnt_idmap *idmap, struct dentry 
> *dentry)
> +{
> + return __vfs_removexattr(idmap, dentry, XATTR_NAME_CAPS);
> +}
> +
> +/**
> + * vfs_remove_fscaps_nosec - remove filesystem capabilities without
> + *   security checks
> + * @idmap: idmap of the mount the inode was found from
> + * @dentry: the dentry from which to remove filesystem capabilities
> + *
> + * This function removes any filesystem capabilities from the specified
> + * dentry. Does not perform any security checks, and callers must hold the
> + * inode lock.
> + *
> + * Return: 0 on success, a negative errno on error.
> + */
> +int vfs_remove_fscaps_nosec(struct mnt_idmap *idmap, struct dentry *dentry)
> +{
> + struct inode *inode = dentry->d_inode;
> + int error;
> +
> + if (inode->i_op->set_fscaps)
> + error =  inode->i_op->set_fscaps(idmap, dentry, NULL,
> +  XATTR_REPLACE);
> + else
> + error = generic_remove_fscaps(idmap, dentry);
> +
> + return error;
> +}
> +
> +/**
> + * vfs_remove_fscaps - remove filesystem capabilities
> + * @idmap: idmap of the mount the inode was found from
> + * @dentry: the dentry from which to remove filesystem capabilities
> + *
> + * This function removes any filesystem capabilities from the specified
> + * dentry.
> + *
> + * Return: 0 on success, a negative errno on error.
> + */
> +int vfs_remove_fscaps(struct mnt_idmap *idmap, struct dentry *dentry)
> +{
> + struct inode *inode = dentry->d_inode;
> + struct inode *delegated_inode = NULL;
> + int error;
> +
> +retry_deleg:
> + inode_lock(inode);
> +
> + error = xattr_permission(idmap, inode, XATTR_NAME_CAPS, MAY_WRITE);
> + if (error)
> + goto out_inode_unlock;

Should also use may_write_xattr() instead of xattr_permission() if
possible.

> +
> + error = security_inode_remove_fscaps(idmap, dentry);
> + if (error)
> + goto out_inode_unlock;
> +
> + error = try_break_deleg(inode, _inode);
> + if (error)
> + goto out_inode_unlock;
> +
> + error = vfs_remove_fscaps_nosec(idmap, dentry);
> + if (!error) {
> + fsnotify_xattr(dentry);
> + evm_inode_post_remove_fscaps(dentry);
> + }
> +
> +out_inode_unlock:
> + inode_unlock(inode);
> +
> + if (delegated_inode) {
> + error = break_deleg_wait(_inode);
> + if (!error)
> + goto retry_deleg;
> + }
> +
> + return error;
> +}
> +EXPORT_SYMBOL(vfs_remove_fscaps);
> +
>  int
>  __vfs_setxattr(struct mnt_idmap *idmap, struct dentry *dentry,
>  struct inode *inode, const char *name, const void *value,
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 4f5d7ed44644..c07427d2fc71 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -2122,6 +2122,8 @@ extern int vfs_get_fscaps(struct mnt_idmap *idmap, 
> struct dentry *dentry,
> struct vfs_caps *caps);
>  extern int vfs_set_fscaps(struct mnt_idmap *idmap, struct dentry *dentry,
> const struct vfs_caps *caps, int setxattr_flags);
> +extern int vfs_remove_fscaps_nosec(struct mnt_idmap *idmap, struct dentry 
> *dentry);
> +extern int vfs_remove_fscaps(struct mnt_idmap *idmap, struct dentry *dentry);

Please drop the extern.

Re: [PATCH v2 18/25] fs: add vfs_set_fscaps()

2024-02-23 Thread Christian Brauner

On Wed, Feb 21, 2024 at 03:24:49PM -0600, Seth Forshee (DigitalOcean) wrote:
> Provide a type-safe interface for setting filesystem capabilities and a
> generic implementation suitable for most filesystems.
> 
> Signed-off-by: Seth Forshee (DigitalOcean) 
> ---
>  fs/xattr.c | 79 
> ++
>  include/linux/fs.h |  2 ++
>  2 files changed, 81 insertions(+)
> 
> diff --git a/fs/xattr.c b/fs/xattr.c
> index 10d1b1f78fc2..96de43928a51 100644
> --- a/fs/xattr.c
> +++ b/fs/xattr.c
> @@ -245,6 +245,85 @@ int vfs_get_fscaps(struct mnt_idmap *idmap, struct 
> dentry *dentry,
>  }
>  EXPORT_SYMBOL(vfs_get_fscaps);
>  
> +static int generic_set_fscaps(struct mnt_idmap *idmap, struct dentry *dentry,
> +   const struct vfs_caps *caps, int setxattr_flags)
> +{
> + struct inode *inode = d_inode(dentry);
> + struct vfs_ns_cap_data nscaps;
> + int size;

ssize_t, I believe.

> +
> + size = vfs_caps_to_xattr(idmap, i_user_ns(inode), caps,
> +  , sizeof(nscaps));
> + if (size < 0)
> + return size;
> +
> + return __vfs_setxattr_noperm(idmap, dentry, XATTR_NAME_CAPS,
> +  , size, setxattr_flags);
> +}
> +
> +/**
> + * vfs_set_fscaps - set filesystem capabilities
> + * @idmap: idmap of the mount the inode was found from
> + * @dentry: the dentry on which to set filesystem capabilities
> + * @caps: the filesystem capabilities to be written
> + * @setxattr_flags: setxattr flags to use when writing the capabilities xattr
> + *
> + * This function writes the supplied filesystem capabilities to the dentry.
> + *
> + * Return: 0 on success, a negative errno on error.
> + */
> +int vfs_set_fscaps(struct mnt_idmap *idmap, struct dentry *dentry,
> +const struct vfs_caps *caps, int setxattr_flags)
> +{
> + struct inode *inode = d_inode(dentry);
> + struct inode *delegated_inode = NULL;
> + int error;
> +
> +retry_deleg:
> + inode_lock(inode);
> +
> + error = xattr_permission(idmap, inode, XATTR_NAME_CAPS, MAY_WRITE);
> + if (error)
> + goto out_inode_unlock;

I think this should be

/*
 * We only care about restrictions the inode struct itself places upon
 * us otherwise fscaps aren't subject to any VFS restrictions.
 */
error = may_write_xattr(idmap, inode);
if (error)
goto out_inode_unlock;

which is a 1:1 copy of what POSIX ACLs do?

> + error = security_inode_set_fscaps(idmap, dentry, caps, setxattr_flags);
> + if (error)
> + goto out_inode_unlock;
> +
> + error = try_break_deleg(inode, _inode);
> + if (error)
> + goto out_inode_unlock;
> +
> + if (inode->i_opflags & IOP_XATTR) {

Fwiw, I think that if we move fscaps off of xattr handlers completely
this can go away and we can simply rely on ->{g,s}et_fscaps() being
implemented. But again, that can be in a follow-up series.

> + if (inode->i_op->set_fscaps)
> + error = inode->i_op->set_fscaps(idmap, dentry, caps,
> + setxattr_flags);
> + else
> + error = generic_set_fscaps(idmap, dentry, caps,
> +setxattr_flags);
> + if (!error) {
> + fsnotify_xattr(dentry);
> + security_inode_post_set_fscaps(idmap, dentry, caps,
> +setxattr_flags);
> + }
> + } else if (unlikely(is_bad_inode(inode))) {
> + error = -EIO;
> + } else {
> + error = -EOPNOTSUPP;
> + }
> +
> +out_inode_unlock:
> + inode_unlock(inode);
> +
> + if (delegated_inode) {
> + error = break_deleg_wait(_inode);
> + if (!error)
> + goto retry_deleg;
> + }
> +
> + return error;
> +}
> +EXPORT_SYMBOL(vfs_set_fscaps);
> +
>  int
>  __vfs_setxattr(struct mnt_idmap *idmap, struct dentry *dentry,
>  struct inode *inode, const char *name, const void *value,
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index d7cd2467e1ea..4f5d7ed44644 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -2120,6 +2120,8 @@ extern int vfs_get_fscaps_nosec(struct mnt_idmap 
> *idmap, struct dentry *dentry,
>   struct vfs_caps *caps);
>  extern int vfs_get_fscaps(struct mnt_idmap *idmap, struct dentry *dentry,
> struct vfs_caps *caps);
> +extern int vfs_set_fscaps(struct mnt_idmap *idmap, struct dentry *dentry,
> +   const struct vfs_caps *caps, int setxattr_flags);

Please drop the extern.

Re: [PATCH v2 17/25] fs: add vfs_get_fscaps()

2024-02-23 Thread Christian Brauner

On Wed, Feb 21, 2024 at 03:24:48PM -0600, Seth Forshee (DigitalOcean) wrote:
> Provide a type-safe interface for retrieving filesystem capabilities and
> a generic implementation suitable for most filesystems. Also add an
> internal interface, vfs_get_fscaps_nosec(), which skips security checks
> for later use from the capability code.
> 
> Signed-off-by: Seth Forshee (DigitalOcean) 
> ---
>  fs/xattr.c | 64 
> ++
>  include/linux/fs.h |  4 
>  2 files changed, 68 insertions(+)
> 
> diff --git a/fs/xattr.c b/fs/xattr.c
> index 06290e4ebc03..10d1b1f78fc2 100644
> --- a/fs/xattr.c
> +++ b/fs/xattr.c
> @@ -181,6 +181,70 @@ xattr_supports_user_prefix(struct inode *inode)
>  }
>  EXPORT_SYMBOL(xattr_supports_user_prefix);
>  
> +static int generic_get_fscaps(struct mnt_idmap *idmap, struct dentry *dentry,
> +   struct vfs_caps *caps)
> +{
> + struct inode *inode = d_inode(dentry);
> + struct vfs_ns_cap_data nscaps;
> + int ret;
> +
> + ret = __vfs_getxattr(dentry, inode, XATTR_NAME_CAPS, , 
> sizeof(nscaps));
> +
> + if (ret >= 0)
> + ret = vfs_caps_from_xattr(idmap, i_user_ns(inode), caps, 
> , ret);
> +
> + return ret;
> +}
> +
> +/**
> + * vfs_get_fscaps_nosec - get filesystem capabilities without security checks
> + * @idmap: idmap of the mount the inode was found from
> + * @dentry: the dentry from which to get filesystem capabilities
> + * @caps: storage in which to return the filesystem capabilities
> + *
> + * This function gets the filesystem capabilities for the dentry and returns
> + * them in @caps. It does not perform security checks.
> + *
> + * Return: 0 on success, a negative errno on error.
> + */
> +int vfs_get_fscaps_nosec(struct mnt_idmap *idmap, struct dentry *dentry,
> +  struct vfs_caps *caps)
> +{
> + struct inode *inode = d_inode(dentry);
> +
> + if (inode->i_op->get_fscaps)
> + return inode->i_op->get_fscaps(idmap, dentry, caps);
> + return generic_get_fscaps(idmap, dentry, caps);
> +}
> +
> +/**
> + * vfs_get_fscaps - get filesystem capabilities
> + * @idmap: idmap of the mount the inode was found from
> + * @dentry: the dentry from which to get filesystem capabilities
> + * @caps: storage in which to return the filesystem capabilities
> + *
> + * This function gets the filesystem capabilities for the dentry and returns
> + * them in @caps.
> + *
> + * Return: 0 on success, a negative errno on error.
> + */
> +int vfs_get_fscaps(struct mnt_idmap *idmap, struct dentry *dentry,
> +struct vfs_caps *caps)
> +{
> + int error;
> +
> + /*
> +  * The VFS has no restrictions on reading security.* xattrs, so
> +  * xattr_permission() isn't needed. Only LSMs get a say.
> +  */
> + error = security_inode_get_fscaps(idmap, dentry);
> + if (error)
> + return error;
> +
> + return vfs_get_fscaps_nosec(idmap, dentry, caps);
> +}
> +EXPORT_SYMBOL(vfs_get_fscaps);
> +
>  int
>  __vfs_setxattr(struct mnt_idmap *idmap, struct dentry *dentry,
>  struct inode *inode, const char *name, const void *value,
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 89163e0f7aad..d7cd2467e1ea 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -2116,6 +2116,10 @@ extern int vfs_dedupe_file_range(struct file *file,
>  extern loff_t vfs_dedupe_file_range_one(struct file *src_file, loff_t 
> src_pos,
>   struct file *dst_file, loff_t dst_pos,
>   loff_t len, unsigned int remap_flags);
> +extern int vfs_get_fscaps_nosec(struct mnt_idmap *idmap, struct dentry 
> *dentry,
> +         struct vfs_caps *caps);
> +extern int vfs_get_fscaps(struct mnt_idmap *idmap, struct dentry *dentry,
> +   struct vfs_caps *caps);

Please drop the externs. Other than my usual complaing about this
falling back to the legacy vfs_*xattr() interfaces,
Reviewed-by: Christian Brauner

Re: [PATCH v2 16/25] fs: add inode operations to get/set/remove fscaps

2024-02-23 Thread Christian Brauner

On Wed, Feb 21, 2024 at 03:24:47PM -0600, Seth Forshee (DigitalOcean) wrote:
> Add inode operations for getting, setting and removing filesystem
> capabilities rather than passing around raw xattr data. This provides
> better type safety for ids contained within xattrs.
> 
> Signed-off-by: Seth Forshee (DigitalOcean) 
> ---

Looks good,
Reviewed-by: Christian Brauner

Re: [PATCH v2 11/25] security: add hooks for set/get/remove of fscaps

2024-02-23 Thread Christian Brauner

On Wed, Feb 21, 2024 at 03:24:42PM -0600, Seth Forshee (DigitalOcean) wrote:
> In preparation for moving fscaps out of the xattr code paths, add new
> security hooks. These hooks are largely needed because common kernel
> code will pass around struct vfs_caps pointers, which EVM will need to
> convert to raw xattr data for verification and updates of its hashes.
> 
> Signed-off-by: Seth Forshee (DigitalOcean) 
> ---

Looks good,
Reviewed-by: Christian Brauner

Re: [PATCH v2 10/25] xattr: use is_fscaps_xattr()

2024-02-23 Thread Christian Brauner

On Wed, Feb 21, 2024 at 03:24:41PM -0600, Seth Forshee (DigitalOcean) wrote:
> Signed-off-by: Seth Forshee (DigitalOcean) 
> ---

Looks good,
Reviewed-by: Christian Brauner

Re: [PATCH v2 09/25] commoncap: use is_fscaps_xattr()

2024-02-23 Thread Christian Brauner

On Wed, Feb 21, 2024 at 03:24:40PM -0600, Seth Forshee (DigitalOcean) wrote:
> Signed-off-by: Seth Forshee (DigitalOcean) 
> ---

Looks good,
Reviewed-by: Christian Brauner

Re: [PATCH v2 08/25] xattr: add is_fscaps_xattr() helper

2024-02-23 Thread Christian Brauner

On Wed, Feb 21, 2024 at 03:24:39PM -0600, Seth Forshee (DigitalOcean) wrote:
> Add a helper to determine if an xattr time is XATTR_NAME_CAPS instead of
> open-coding a string comparision.
> 
> Suggested-by: Amir Goldstein 
> Signed-off-by: Seth Forshee (DigitalOcean) 
> ---

Looks good,
Reviewed-by: Christian Brauner

Re: [PATCH v2 06/25] capability: provide helpers for converting between xattrs and vfs_caps

2024-02-23 Thread Christian Brauner

On Thu, Feb 22, 2024 at 09:38:04AM -0600, Seth Forshee (DigitalOcean) wrote:
> On Thu, Feb 22, 2024 at 04:20:08PM +0100, Christian Brauner wrote:
> > > + if ((magic_etc & VFS_CAP_REVISION_MASK) != VFS_CAP_REVISION_1) {
> > > + vfs_caps->permitted.val += 
> > > (u64)le32_to_cpu(caps->data[1].permitted) << 32;
> > > + vfs_caps->inheritable.val += 
> > > (u64)le32_to_cpu(caps->data[1].inheritable) << 32;
> > 
> > That + makes this even more difficult to read. This should be rewritten.
> 
> Do you meant that you would prefer |= to +=, or do you have something

Yes.

Re: [PATCH v2 00/25] fs: use type-safe uid representation for filesystem capabilities

2024-02-22 Thread Christian Brauner

On Wed, Feb 21, 2024 at 03:24:31PM -0600, Seth Forshee (DigitalOcean) wrote:
> This series converts filesystem capabilities from passing around raw
> xattr data to using a kernel-internal representation with type safe
> uids, similar to the conversion done previously for posix ACLs.
> Currently fscaps representations in the kernel have two different
> instances of unclear or confused types:
> 
> - fscaps are generally passed around in the raw xattr form, with the
>   rootid sometimes containing the user uid value and at other times
>   containing the filesystem value.
> - The existing kernel-internal representation of fscaps,
>   cpu_vfs_cap_data, uses the kuid_t type, but the value stored is
>   actually a vfsuid.
> 
> This series eliminates this confusion by converting the xattr data to
> the kernel representation near the userspace and filesystem boundaries,
> using the kernel representation within the vfs and commoncap code. The
> internal representation is renamed to vfs_caps to reflect this broader
> use, and the rootid is changed to a vfsuid_t to correctly identify the
> type of uid which it contains.
> 
> New vfs interfaces are added to allow for getting and setting fscaps
> using the kernel representation. This requires the addition of new inode
> operations to allow overlayfs to handle fscaps properly; all other
> filesystems fall back to a generic implementation. The top-level vfs
> xattr interfaces will now reject fscaps xattrs, though the lower-level
> interfaces continue to accept them for reading and writing the raw xattr
> data.
> 
> Based on previous feedback, new security hooks are added for fscaps
> operations. These are really only needed for EVM, and the selinux and
> smack implementations just peform the same operations that the
> equivalent xattr hooks would have done. Note too that this has not yet
> been updated based on the changes to make EVM into an LSM.
> 
> The remainder of the changes are preparatory work, addition of helpers
> for converting between the xattr and kernel fscaps representation, and
> various updates to use the kernel representation and new interfaces.

I still think that the generic_{get,set,remove}_fscaps() helpers falling
back to plain *vfs_*xattr() calls is a hackish. So ideally I'd like to
see this killed in a follow-up series and make all fses that support
them use the inode operation.

> 
> I have tested this code with xfstests, ltp, libcap2, and libcap-ng with
> no regressions found.

+1

Re: [PATCH v2 07/25] capability: provide a helper for converting vfs_caps to xattr for userspace

2024-02-22 Thread Christian Brauner

On Wed, Feb 21, 2024 at 03:24:38PM -0600, Seth Forshee (DigitalOcean) wrote:
> cap_inode_getsecurity() implements a handful of policies for capability
> xattrs read by userspace:
> 
>  - It returns EINVAL if the on-disk capability is in v1 format.
> 
>  - It masks off all bits in magic_etc except for the version and
>VFS_CAP_FLAGS_EFFECTIVE.
> 
>  - v3 capabilities are converted to v2 format if the rootid returned to
>userspace would be 0 or if the rootid corresponds to root in an
>ancestor user namespace.
> 
>  - It returns EOVERFLOW for a v3 capability whose rootid does not map to
>a valid id in current_user_ns() or to root in an ancestor namespace.
> 
> These policies must be maintained when converting vfs_caps to an xattr
> for userspace. Provide a vfs_caps_to_user_xattr() helper which will
> enforce these policies.
> 
> Signed-off-by: Seth Forshee (DigitalOcean) 
> ---

Looks good,
Reviewed-by: Christian Brauner

Re: [PATCH v2 06/25] capability: provide helpers for converting between xattrs and vfs_caps

2024-02-22 Thread Christian Brauner

On Wed, Feb 21, 2024 at 03:24:37PM -0600, Seth Forshee (DigitalOcean) wrote:
> To pass around vfs_caps instead of raw xattr data we will need to
> convert between the two representations near userspace and disk
> boundaries. We already convert xattrs from disks to vfs_caps, so move
> that code into a helper, and change get_vfs_caps_from_disk() to use the
> helper.
> 
> When converting vfs_caps to xattrs we have different considerations
> depending on the destination of the xattr data. For xattrs which will be
> written to disk we need to reject the xattr if the rootid does not map
> into the filesystem's user namespace, whereas xattrs read by userspace
> may need to undergo a conversion from v3 to v2 format when the rootid
> does not map. So this helper is split into an internal and an external
> interface. The internal interface does not return an error if the rootid
> has no mapping in the target user namespace and will be used for
> conversions targeting userspace. The external interface returns
> EOVERFLOW if the rootid has no mapping and will be used for all other
> conversions.
> 
> Signed-off-by: Seth Forshee (DigitalOcean) 
> ---
>  include/linux/capability.h |  10 ++
>  security/commoncap.c   | 228 
> +++--
>  2 files changed, 187 insertions(+), 51 deletions(-)
> 
> diff --git a/include/linux/capability.h b/include/linux/capability.h
> index eb46d346bbbc..a0893ac4664b 100644
> --- a/include/linux/capability.h
> +++ b/include/linux/capability.h
> @@ -209,6 +209,16 @@ static inline bool checkpoint_restore_ns_capable(struct 
> user_namespace *ns)
>   ns_capable(ns, CAP_SYS_ADMIN);
>  }
>  
> +/* helpers to convert between xattr and in-kernel representations */
> +int vfs_caps_from_xattr(struct mnt_idmap *idmap,
> + struct user_namespace *src_userns,
> + struct vfs_caps *vfs_caps,
> + const void *data, size_t size);
> +ssize_t vfs_caps_to_xattr(struct mnt_idmap *idmap,
> +   struct user_namespace *dest_userns,
> +   const struct vfs_caps *vfs_caps,
> +   void *data, size_t size);
> +
>  /* audit system wants to get cap info from files as well */
>  int get_vfs_caps_from_disk(struct mnt_idmap *idmap,
>  const struct dentry *dentry,
> diff --git a/security/commoncap.c b/security/commoncap.c
> index a0b5c9740759..7531c9634997 100644
> --- a/security/commoncap.c
> +++ b/security/commoncap.c
> @@ -619,54 +619,41 @@ static inline int bprm_caps_from_vfs_caps(struct 
> vfs_caps *caps,
>  }
>  
>  /**
> - * get_vfs_caps_from_disk - retrieve vfs caps from disk
> + * vfs_caps_from_xattr - convert raw caps xattr data to vfs_caps
>   *
> - * @idmap:   idmap of the mount the inode was found from
> - * @dentry:  dentry from which @inode is retrieved
> - * @cpu_caps:vfs capabilities
> + * @idmap:  idmap of the mount the inode was found from
> + * @src_userns: user namespace for ids in xattr data
> + * @vfs_caps:   destination buffer for vfs_caps data
> + * @data:   rax xattr caps data
> + * @size:   size of xattr data
>   *
> - * Extract the on-exec-apply capability sets for an executable file.
> + * Converts a raw security.capability xattr into the kernel-internal
> + * capabilities format.
>   *
> - * If the inode has been found through an idmapped mount the idmap of
> - * the vfsmount must be passed through @idmap. This function will then
> - * take care to map the inode according to @idmap before checking
> - * permissions. On non-idmapped mounts or if permission checking is to be
> - * performed on the raw inode simply pass @nop_mnt_idmap.
> + * If the xattr is being read or written through an idmapped mount the
> + * idmap of the vfsmount must be passed through @idmap. This function
> + * will then take care to map the rootid according to @idmap.
> + *
> + * Return: On success, return 0; on error, return < 0.
>   */
> -int get_vfs_caps_from_disk(struct mnt_idmap *idmap,
> -const struct dentry *dentry,
> -struct vfs_caps *cpu_caps)
> +int vfs_caps_from_xattr(struct mnt_idmap *idmap,
> + struct user_namespace *src_userns,
> + struct vfs_caps *vfs_caps,
> + const void *data, size_t size)
>  {
> - struct inode *inode = d_backing_inode(dentry);
>   __u32 magic_etc;
> - int size;
> - struct vfs_ns_cap_data data, *nscaps = 
> - struct vfs_cap_data *caps = (struct vfs_cap_data *) 
> + const struct vfs_ns_cap_data *ns_caps = data;
> + struct vfs_cap_data *caps = (struct vfs_cap_data *)ns_caps;
>   kuid_t rootkuid;
> - vfsuid_t rootvfsuid;
> - struct user_namespace *fs_ns;
> -
> - memset(cpu_caps, 0, sizeof(struct vfs_caps));
> -
> - if (!inode)
> - return -ENODATA;
>  
> - fs_ns = inode->i_sb->s_user_ns;
> - size =

Re: [PATCH v2 05/25] capability: use vfsuid_t for vfs_caps rootids

2024-02-22 Thread Christian Brauner

On Wed, Feb 21, 2024 at 03:24:36PM -0600, Seth Forshee (DigitalOcean) wrote:
> The rootid is a kuid_t, but it contains an id which maped into a mount
> idmapping, so it is really a vfsuid. This is confusing and creates
> potential for misuse of the value, so change it to vfsuid_t.
> 
> Acked-by: Paul Moore 
> Signed-off-by: Seth Forshee (DigitalOcean) 
> ---

Looks good,
Reviewed-by: Christian Brauner

Re: [PATCH v2 03/25] capability: add static asserts for comapatibility of vfs_cap_data and vfs_ns_cap_data

2024-02-22 Thread Christian Brauner

On Wed, Feb 21, 2024 at 03:24:34PM -0600, Seth Forshee (DigitalOcean) wrote:
> Capability code depends on vfs_ns_cap_data being an extension of
> vfs_cap_data, so verify this at compile time.
> 
> Suggested-by: Christian Brauner 
> Signed-off-by: Seth Forshee (DigitalOcean) 
> ---

Looks good,
Reviewed-by: Christian Brauner

Re: [PATCH v2 02/25] mnt_idmapping: include cred.h

2024-02-22 Thread Christian Brauner

On Wed, Feb 21, 2024 at 03:24:33PM -0600, Seth Forshee (DigitalOcean) wrote:
> mnt_idmapping.h uses declarations from cred.h, so it should include that
> file directly.
> 
> Signed-off-by: Seth Forshee (DigitalOcean) 
> ---

Looks good,
Reviewed-by: Christian Brauner

Re: [PATCH v2 01/25] mnt_idmapping: split out core vfs[ug]id_t definitions into vfsid.h

2024-02-22 Thread Christian Brauner

On Wed, Feb 21, 2024 at 03:24:32PM -0600, Seth Forshee (DigitalOcean) wrote:
> The rootid member of cpu_vfs_cap_data is a kuid_t, but it should be a
> vfsuid_t as the id stored there is mapped into the mount idmapping. It's
> currently impossible to use vfsuid_t within cred.h though as it is
> defined in mnt_idmapping.h, which uses definitions from cred.h.
> 
> Split out the core vfsid type definitions into a separate file which can
> be included from cred.h.
> 
> Signed-off-by: Seth Forshee (DigitalOcean) 
> ---

Looks good,
Reviewed-by: Christian Brauner

Re: [PATCH 2/2] netfs: Fix missing zero-length check in unbuffered write

2024-02-20 Thread Christian Brauner

On Mon, Feb 19, 2024 at 09:38:33AM +0100, Linux regression tracking (Thorsten 
Leemhuis) wrote:
> On 29.01.24 10:49, David Howells wrote:
> > Fix netfs_unbuffered_write_iter() to return immediately if
> > generic_write_checks() returns 0, indicating there's nothing to write.
> > Note that netfs_file_write_iter() already does this.
> > 
> > Also, whilst we're at it, put in checks for the size being zero before we
> > even take the locks.  Note that generic_write_checks() can still reduce the
> > size to zero, so we still need that check.
> > 
> > Without this, a warning similar to the following is logged to dmesg:
> > 
> > netfs: Zero-sized write [R=1b6da]
> > 
> > and the syscall fails with EIO, e.g.:
> > 
> > /sbin/ldconfig.real: Writing of cache extension data failed: 
> > Input/output error
> > 
> > This can be reproduced on 9p by:
> > 
> > xfs_io -f -c 'pwrite 0 0' /xfstest.test/foo
> > 
> > Fixes: 153a9961b551 ("netfs: Implement unbuffered/DIO write support")
> > Reported-by: Eric Van Hensbergen 
> > Link: https://lore.kernel.org/r/ZbQUU6QKmIftKsmo@FV7GG9FTHL/
> 
> David, thx for fixing Eric's regression, which I'm tracking.
> 
> Christian, just wondering: that patch afaics is sitting in vfs.netfs for
> about three weeks now -- is that intentional or did it maybe fell
> through the cracks somehow?

I've moved it to vfs.fixes now and will send later this week. Thanks for
the reminder!

Re: [PATCH RESEND] cachefiles: fix memory leak in cachefiles_add_cache()

2024-02-20 Thread Christian Brauner

On Sat, 17 Feb 2024 16:14:31 +0800, Baokun Li wrote:
> The following memory leak was reported after unbinding /dev/cachefiles:
> 
> ==
> unreferenced object 0x9b674176e3c0 (size 192):
>   comm "cachefilesd2", pid 680, jiffies 4294881224
>   hex dump (first 32 bytes):
> 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  
> 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  
>   backtrace (crc ea38a44b):
> [] kmem_cache_alloc+0x2d5/0x370
> [] prepare_creds+0x26/0x2e0
> [] cachefiles_determine_cache_security+0x1f/0x120
> [] cachefiles_add_cache+0x13c/0x3a0
> [] cachefiles_daemon_write+0x146/0x1c0
> [] vfs_write+0xcb/0x520
> [] ksys_write+0x69/0xf0
> [] do_syscall_64+0x72/0x140
> [] entry_SYSCALL_64_after_hwframe+0x6e/0x76
> ==
> 
> [...]

Sorry for the delay, David.

---

Applied to the vfs.fixes branch of the vfs/vfs.git tree.
Patches in the vfs.fixes branch should appear in linux-next soon.

Please report any outstanding bugs that were missed during review in a
new review to the original patch series allowing us to drop it.

It's encouraged to provide Acked-bys and Reviewed-bys even though the
patch has now been applied. If possible patch trailers will be updated.

Note that commit hashes shown below are subject to change due to rebase,
trailer updates or similar. If in doubt, please check the listed branch.

tree:   https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git
branch: vfs.fixes

[1/1] cachefiles: fix memory leak in cachefiles_add_cache()
  https://git.kernel.org/vfs/vfs/c/e21a2f17566c

Re: Coverity: __do_sys_pidfd_send_signal(): UNINIT

2024-02-16 Thread Christian Brauner

On Wed, Feb 14, 2024 at 08:18:01PM +0100, Oleg Nesterov wrote:
> On 02/14, Tycho Andersen wrote:
> >
> > On Wed, Feb 14, 2024 at 06:55:55PM +0100, Oleg Nesterov wrote:
> > >
> > > We want to check the "flags" argument at the start, we do not want to
> > > delay the "case 0:" check until we have f.file (so that we can check
> > > f.file->f_flags).
> >
> > Fair point. I was thinking delaying it would make it simpler, but then
> > you have to free the file and it's less fast in the EINVAL case.
> 
> plus we do not want to return, say, -EBADF if the "flags" argument is wrong.
> 
> > I also don't have a strong opinion here.
> 
> Neither me.

Or you know, we just don't care about this. ;)
In any case since tis is a false positive it's not urgent in any way. If
either of you cares enough about this then please just send me patch that
reorders the checks to please that tool. The specific way doesn't matter
to me as well as long as we don't pointlessly fdget()/fdput().

Re: [f2fs-dev] [PATCH v5 04/12] fscrypt: Drop d_revalidate for valid dentries during lookup

2024-02-09 Thread Christian Brauner

On Fri, Feb 02, 2024 at 11:50:07AM -0300, Gabriel Krisman Bertazi wrote:
> Eric Biggers  writes:
> 
> > On Wed, Jan 31, 2024 at 03:35:40PM -0300, Gabriel Krisman Bertazi wrote:
> >> Eric Biggers  writes:
> >> 
> >> > On Mon, Jan 29, 2024 at 05:43:22PM -0300, Gabriel Krisman Bertazi wrote:
> >> >> Unencrypted and encrypted-dentries where the key is available don't need
> >> >> to be revalidated with regards to fscrypt, since they don't go stale
> >> >> from under VFS and the key cannot be removed for the encrypted case
> >> >> without evicting the dentry.  Mark them with d_set_always_valid, to
> >> >
> >> > "d_set_always_valid" doesn't appear in the diff itself.
> >> >
> >> >> diff --git a/include/linux/fscrypt.h b/include/linux/fscrypt.h
> >> >> index 4aaf847955c0..a22997b9f35c 100644
> >> >> --- a/include/linux/fscrypt.h
> >> >> +++ b/include/linux/fscrypt.h
> >> >> @@ -942,11 +942,22 @@ static inline int fscrypt_prepare_rename(struct 
> >> >> inode *old_dir,
> >> >>  static inline void fscrypt_prepare_lookup_dentry(struct dentry *dentry,
> >> >>  bool is_nokey_name)
> >> >>  {
> >> >> -   if (is_nokey_name) {
> >> >> -   spin_lock(>d_lock);
> >> >> +   spin_lock(>d_lock);
> >> >> +
> >> >> +   if (is_nokey_name)
> >> >> dentry->d_flags |= DCACHE_NOKEY_NAME;
> >> >> -   spin_unlock(>d_lock);
> >> >> +   else if (dentry->d_flags & DCACHE_OP_REVALIDATE &&
> >> >> +dentry->d_op->d_revalidate == fscrypt_d_revalidate) {
> >> >> +   /*
> >> >> +* Unencrypted dentries and encrypted dentries where the
> >> >> +* key is available are always valid from fscrypt
> >> >> +* perspective. Avoid the cost of calling
> >> >> +* fscrypt_d_revalidate unnecessarily.
> >> >> +*/
> >> >> +   dentry->d_flags &= ~DCACHE_OP_REVALIDATE;
> >> >> }
> >> >> +
> >> >> +   spin_unlock(>d_lock);
> >> >
> >> > This makes lookups in unencrypted directories start doing the
> >> > spin_lock/spin_unlock pair.  Is that really necessary?
> >> >
> >> > These changes also make the inline function fscrypt_prepare_lookup() 
> >> > very long
> >> > (when including the fscrypt_prepare_lookup_dentry() that's inlined into 
> >> > it).
> >> > The rule that I'm trying to follow is that to the extent that the 
> >> > fscrypt helper
> >> > functions are inlined, the inline part should be a fast path for 
> >> > unencrypted
> >> > directories.  Encrypted directories should be handled out-of-line.
> >> >
> >> > So looking at the original fscrypt_prepare_lookup():
> >> >
> >> >  static inline int fscrypt_prepare_lookup(struct inode *dir,
> >> >   struct dentry *dentry,
> >> >   struct fscrypt_name *fname)
> >> >  {
> >> >  if (IS_ENCRYPTED(dir))
> >> >  return __fscrypt_prepare_lookup(dir, dentry, fname);
> >> >
> >> >  memset(fname, 0, sizeof(*fname));
> >> >  fname->usr_fname = >d_name;
> >> >  fname->disk_name.name = (unsigned char *)dentry->d_name.name;
> >> >  fname->disk_name.len = dentry->d_name.len;
> >> >  return 0;
> >> >  }
> >> >
> >> > If you could just add the DCACHE_OP_REVALIDATE clearing for dentries in
> >> > unencrypted directories just before the "return 0;", hopefully without 
> >> > the
> >> > spinlock, that would be good.  Yes, that does mean that
> >> > __fscrypt_prepare_lookup() will have to handle it too, for the case of 
> >> > dentries
> >> > in encrypted directories, but that seems okay.
> >> 
> >> ok, will do.  IIUC, we might be able to do without the d_lock
> >> provided there is no store tearing.
> >> 
> >> But what was the reason you need the d_lock to set DCACHE_NOKEY_NAME
> >> during lookup?  Is there a race with parallel lookup setting d_flag that
> >> I couldn't find? Or is it another reason?
> >
> > d_flags is documented to be protected by d_lock.  So for setting
> > DCACHE_NOKEY_NAME, fs/crypto/ just does the safe thing of taking d_lock.  I
> > never really looked into whether the lock can be skipped there (i.e., 
> > whether
> > anything else can change d_flags while ->lookup is running), since this code
> > only ran for no-key names, for which performance isn't really important.
> 
> Yes, I was looking for the actual race that could happen here, and
> couldn't find one. As far as I understand it, the only thing that could
> see the dentry during a lookup would be a parallel lookup, but those
> will be held waiting for completion in d_alloc_parallel, and won't touch
> d_flags.  Currently, right after this code, we call d_set_d_op() in
> generic_set_encrypted_ci_d_ops(), which will happily write d_flags without
> the d_lock. If this is a problem here, we have a problem there.
> 
> What I really don't want to do is keep the lock for

Re: [PATCH v9 12/25] security: Introduce file_post_open hook

2024-02-09 Thread Christian Brauner

On Fri, Feb 09, 2024 at 11:46:16AM +0100, Roberto Sassu wrote:
> On Fri, 2024-02-09 at 11:12 +0100, Christian Brauner wrote:
> > On Mon, Jan 15, 2024 at 07:17:56PM +0100, Roberto Sassu wrote:
> > > From: Roberto Sassu 
> > > 
> > > In preparation to move IMA and EVM to the LSM infrastructure, introduce 
> > > the
> > > file_post_open hook. Also, export security_file_post_open() for NFS.
> > > 
> > > Based on policy, IMA calculates the digest of the file content and
> > > extends the TPM with the digest, verifies the file's integrity based on
> > > the digest, and/or includes the file digest in the audit log.
> > > 
> > > LSMs could similarly take action depending on the file content and the
> > > access mask requested with open().
> > > 
> > > The new hook returns a value and can cause the open to be aborted.
> > > 
> > > Signed-off-by: Roberto Sassu 
> > > Reviewed-by: Stefan Berger 
> > > Acked-by: Casey Schaufler 
> > > Reviewed-by: Mimi Zohar 
> > > ---
> > >  fs/namei.c|  2 ++
> > >  fs/nfsd/vfs.c |  6 ++
> > >  include/linux/lsm_hook_defs.h |  1 +
> > >  include/linux/security.h  |  6 ++
> > >  security/security.c   | 17 +
> > >  5 files changed, 32 insertions(+)
> > > 
> > > diff --git a/fs/namei.c b/fs/namei.c
> > > index 71c13b2990b4..fb93d3e13df6 100644
> > > --- a/fs/namei.c
> > > +++ b/fs/namei.c
> > > @@ -3620,6 +3620,8 @@ static int do_open(struct nameidata *nd,
> > >   error = may_open(idmap, >path, acc_mode, open_flag);
> > >   if (!error && !(file->f_mode & FMODE_OPENED))
> > >   error = vfs_open(>path, file);
> > > + if (!error)
> > > + error = security_file_post_open(file, op->acc_mode);
> > 
> > What does it do for O_CREAT? IOW, we managed to create that thing and we
> > managed to open that thing. Can security_file_post_open() and
> > ima_file_check() fail afterwards even for newly created files?
> 
> $ strace touch test-file
> ...
> openat(AT_FDCWD, "test-file", O_WRONLY|O_CREAT|O_NOCTTY|O_NONBLOCK, 0666) = 
> -1 EPERM (Operation not permitted)

Ah, meh. I was hoping IMA just wouldn't care about this case.

Re: [PATCH v9 10/25] security: Introduce inode_post_setattr hook

2024-02-09 Thread Christian Brauner

On Mon, Jan 15, 2024 at 07:17:54PM +0100, Roberto Sassu wrote:
> From: Roberto Sassu 
> 
> In preparation for moving IMA and EVM to the LSM infrastructure, introduce
> the inode_post_setattr hook.
> 
> At inode_setattr hook, EVM verifies the file's existing HMAC value. At
> inode_post_setattr, EVM re-calculates the file's HMAC based on the modified
> file attributes and other file metadata.
> 
> Other LSMs could similarly take some action after successful file attribute
> change.
> 
> The new hook cannot return an error and cannot cause the operation to be
> reverted.
> 
> Signed-off-by: Roberto Sassu 
> Reviewed-by: Stefan Berger 
> Reviewed-by: Mimi Zohar 
> Acked-by: Casey Schaufler 
> ---
>  fs/attr.c |  1 +

Acked-by: Christian Brauner

Re: [PATCH v9 11/25] security: Introduce inode_post_removexattr hook

2024-02-09 Thread Christian Brauner

On Mon, Jan 15, 2024 at 07:17:55PM +0100, Roberto Sassu wrote:
> From: Roberto Sassu 
> 
> In preparation for moving IMA and EVM to the LSM infrastructure, introduce
> the inode_post_removexattr hook.
> 
> At inode_removexattr hook, EVM verifies the file's existing HMAC value. At
> inode_post_removexattr, EVM re-calculates the file's HMAC with the passed
> xattr removed and other file metadata.
> 
> Other LSMs could similarly take some action after successful xattr removal.
> 
> The new hook cannot return an error and cannot cause the operation to be
> reverted.
> 
> Signed-off-by: Roberto Sassu 
> Reviewed-by: Stefan Berger 
> Reviewed-by: Mimi Zohar 
> Reviewed-by: Casey Schaufler 
> ---
>  fs/xattr.c    |  9 +

Acked-by: Christian Brauner

Re: [PATCH v9 13/25] security: Introduce file_release hook

2024-02-09 Thread Christian Brauner

On Mon, Jan 15, 2024 at 07:17:57PM +0100, Roberto Sassu wrote:
> From: Roberto Sassu 
> 
> In preparation for moving IMA and EVM to the LSM infrastructure, introduce
> the file_release hook.
> 
> IMA calculates at file close the new digest of the file content and writes
> it to security.ima, so that appraisal at next file access succeeds.
> 
> An LSM could implement an exclusive access scheme for files, only allowing
> access to files that have no references.
> 
> The new hook cannot return an error and cannot cause the operation to be
> reverted.
> 
> Signed-off-by: Roberto Sassu 
> ---
>  fs/file_table.c   |  1 +
>  include/linux/lsm_hook_defs.h |  1 +
>  include/linux/security.h  |  4 
>  security/security.c   | 11 +++
>  4 files changed, 17 insertions(+)
> 
> diff --git a/fs/file_table.c b/fs/file_table.c
> index de4a2915bfd4..c72dc75f2bd3 100644
> --- a/fs/file_table.c
> +++ b/fs/file_table.c
> @@ -385,6 +385,7 @@ static void __fput(struct file *file)
>   eventpoll_release(file);
>   locks_remove_file(file);
>  
> + security_file_release(file);
>   ima_file_free(file);

This has always been an extremely dicy hook in here and that's caused us
issues before for stacking filesystems so I'm not enthusiastic about
exposing this to all LSMs. So reluctantly,

Acked-by: Christian Brauner

Re: [PATCH v9 12/25] security: Introduce file_post_open hook

2024-02-09 Thread Christian Brauner

On Mon, Jan 15, 2024 at 07:17:56PM +0100, Roberto Sassu wrote:
> From: Roberto Sassu 
> 
> In preparation to move IMA and EVM to the LSM infrastructure, introduce the
> file_post_open hook. Also, export security_file_post_open() for NFS.
> 
> Based on policy, IMA calculates the digest of the file content and
> extends the TPM with the digest, verifies the file's integrity based on
> the digest, and/or includes the file digest in the audit log.
> 
> LSMs could similarly take action depending on the file content and the
> access mask requested with open().
> 
> The new hook returns a value and can cause the open to be aborted.
> 
> Signed-off-by: Roberto Sassu 
> Reviewed-by: Stefan Berger 
> Acked-by: Casey Schaufler 
> Reviewed-by: Mimi Zohar 
> ---
>  fs/namei.c|  2 ++
>  fs/nfsd/vfs.c |  6 ++
>  include/linux/lsm_hook_defs.h |  1 +
>  include/linux/security.h  |  6 ++
>  security/security.c   | 17 +
>  5 files changed, 32 insertions(+)
> 
> diff --git a/fs/namei.c b/fs/namei.c
> index 71c13b2990b4..fb93d3e13df6 100644
> --- a/fs/namei.c
> +++ b/fs/namei.c
> @@ -3620,6 +3620,8 @@ static int do_open(struct nameidata *nd,
>   error = may_open(idmap, >path, acc_mode, open_flag);
>   if (!error && !(file->f_mode & FMODE_OPENED))
>   error = vfs_open(>path, file);
> + if (!error)
> + error = security_file_post_open(file, op->acc_mode);

What does it do for O_CREAT? IOW, we managed to create that thing and we
managed to open that thing. Can security_file_post_open() and
ima_file_check() fail afterwards even for newly created files?

Re: [PATCH v9 12/25] security: Introduce file_post_open hook

2024-02-09 Thread Christian Brauner

On Fri, Feb 09, 2024 at 10:56:33AM +0100, Christian Brauner wrote:
> On Mon, Jan 15, 2024 at 07:17:56PM +0100, Roberto Sassu wrote:
> > From: Roberto Sassu 
> > 
> > In preparation to move IMA and EVM to the LSM infrastructure, introduce the
> > file_post_open hook. Also, export security_file_post_open() for NFS.
> > 
> > Based on policy, IMA calculates the digest of the file content and
> > extends the TPM with the digest, verifies the file's integrity based on
> > the digest, and/or includes the file digest in the audit log.
> > 
> > LSMs could similarly take action depending on the file content and the
> > access mask requested with open().
> > 
> > The new hook returns a value and can cause the open to be aborted.
> > 
> > Signed-off-by: Roberto Sassu 
> > Reviewed-by: Stefan Berger 
> > Acked-by: Casey Schaufler 
> > Reviewed-by: Mimi Zohar 
> > ---
> >  fs/namei.c|  2 ++
> 
> Acked-by: Christian Brauner 

Redacting that. I have a question.

Re: [PATCH v9 12/25] security: Introduce file_post_open hook

2024-02-09 Thread Christian Brauner

On Mon, Jan 15, 2024 at 07:17:56PM +0100, Roberto Sassu wrote:
> From: Roberto Sassu 
> 
> In preparation to move IMA and EVM to the LSM infrastructure, introduce the
> file_post_open hook. Also, export security_file_post_open() for NFS.
> 
> Based on policy, IMA calculates the digest of the file content and
> extends the TPM with the digest, verifies the file's integrity based on
> the digest, and/or includes the file digest in the audit log.
> 
> LSMs could similarly take action depending on the file content and the
> access mask requested with open().
> 
> The new hook returns a value and can cause the open to be aborted.
> 
> Signed-off-by: Roberto Sassu 
> Reviewed-by: Stefan Berger 
> Acked-by: Casey Schaufler 
> Reviewed-by: Mimi Zohar 
> ---
>  fs/namei.c|  2 ++

Acked-by: Christian Brauner

Re: [PATCH v9 14/25] security: Introduce path_post_mknod hook

2024-02-09 Thread Christian Brauner

On Mon, Jan 15, 2024 at 07:17:58PM +0100, Roberto Sassu wrote:
> From: Roberto Sassu 
> 
> In preparation for moving IMA and EVM to the LSM infrastructure, introduce
> the path_post_mknod hook.
> 
> IMA-appraisal requires all existing files in policy to have a file
> hash/signature stored in security.ima. An exception is made for empty files
> created by mknod, by tagging them as new files.
> 
> LSMs could also take some action after files are created.
> 
> The new hook cannot return an error and cannot cause the operation to be
> reverted.
> 
> Signed-off-by: Roberto Sassu 
> Acked-by: Casey Schaufler 
> Reviewed-by: Mimi Zohar 
> ---
>  fs/namei.c    |  5 +

Acked-by: Christian Brauner

Re: [PATCH v9 15/25] security: Introduce inode_post_create_tmpfile hook

2024-02-09 Thread Christian Brauner

On Mon, Jan 15, 2024 at 07:17:59PM +0100, Roberto Sassu wrote:
> From: Roberto Sassu 
> 
> In preparation for moving IMA and EVM to the LSM infrastructure, introduce
> the inode_post_create_tmpfile hook.
> 
> As temp files can be made persistent, treat new temp files like other new
> files, so that the file hash is calculated and stored in the security
> xattr.
> 
> LSMs could also take some action after temp files have been created.
> 
> The new hook cannot return an error and cannot cause the operation to be
> canceled.
> 
> Signed-off-by: Roberto Sassu 
> Acked-by: Casey Schaufler 
> Reviewed-by: Mimi Zohar 
> ---
>  fs/namei.c|  1 +

Acked-by: Christian Brauner

Re: [PATCH v9 17/25] security: Introduce inode_post_remove_acl hook

2024-02-09 Thread Christian Brauner

On Mon, Jan 15, 2024 at 07:18:01PM +0100, Roberto Sassu wrote:
> From: Roberto Sassu 
> 
> In preparation for moving IMA and EVM to the LSM infrastructure, introduce
> the inode_post_remove_acl hook.
> 
> At inode_remove_acl hook, EVM verifies the file's existing HMAC value. At
> inode_post_remove_acl, EVM re-calculates the file's HMAC with the passed
> POSIX ACL removed and other file metadata.
> 
> Other LSMs could similarly take some action after successful POSIX ACL
> removal.
> 
> The new hook cannot return an error and cannot cause the operation to be
> reverted.
> 
> Signed-off-by: Roberto Sassu 
> Reviewed-by: Stefan Berger 
> Acked-by: Casey Schaufler 
> Reviewed-by: Mimi Zohar 
> ---
>  fs/posix_acl.c|  1 +

Acked-by: Christian Brauner

Re: [PATCH v9 16/25] security: Introduce inode_post_set_acl hook

2024-02-09 Thread Christian Brauner

On Mon, Jan 15, 2024 at 07:18:00PM +0100, Roberto Sassu wrote:
> From: Roberto Sassu 
> 
> In preparation for moving IMA and EVM to the LSM infrastructure, introduce
> the inode_post_set_acl hook.
> 
> At inode_set_acl hook, EVM verifies the file's existing HMAC value. At
> inode_post_set_acl, EVM re-calculates the file's HMAC based on the modified
> POSIX ACL and other file metadata.
> 
> Other LSMs could similarly take some action after successful POSIX ACL
> change.
> 
> The new hook cannot return an error and cannot cause the operation to be
> reverted.
> 
> Signed-off-by: Roberto Sassu 
> Reviewed-by: Stefan Berger 
> Acked-by: Casey Schaufler 
> Reviewed-by: Mimi Zohar 
> ---
>  fs/posix_acl.c|  1 +

Acked-by: Christian Brauner

Re: [PATCH v9 20/25] ima: Move to LSM infrastructure

2024-02-09 Thread Christian Brauner

On Mon, Jan 15, 2024 at 07:18:04PM +0100, Roberto Sassu wrote:
> From: Roberto Sassu 
> 
> Move hardcoded IMA function calls (not appraisal-specific functions) from
> various places in the kernel to the LSM infrastructure, by introducing a
> new LSM named 'ima' (at the end of the LSM list and always enabled like
> 'integrity').
> 
> Having IMA before EVM in the Makefile is sufficient to preserve the
> relative order of the new 'ima' LSM in respect to the upcoming 'evm' LSM,
> and thus the order of IMA and EVM function calls as when they were
> hardcoded.
> 
> Make moved functions as static (except ima_post_key_create_or_update(),
> which is not in ima_main.c), and register them as implementation of the
> respective hooks in the new function init_ima_lsm().
> 
> Select CONFIG_SECURITY_PATH, to ensure that the path-based LSM hook
> path_post_mknod is always available and ima_post_path_mknod() is always
> executed to mark files as new, as before the move.
> 
> A slight difference is that IMA and EVM functions registered for the
> inode_post_setattr, inode_post_removexattr, path_post_mknod,
> inode_post_create_tmpfile, inode_post_set_acl and inode_post_remove_acl
> won't be executed for private inodes. Since those inodes are supposed to be
> fs-internal, they should not be of interest of IMA or EVM. The S_PRIVATE
> flag is used for anonymous inodes, hugetlbfs, reiserfs xattrs, XFS scrub
> and kernel-internal tmpfs files.
> 
> Conditionally register ima_post_key_create_or_update() if
> CONFIG_IMA_MEASURE_ASYMMETRIC_KEYS is enabled. Also, conditionally register
> ima_kernel_module_request() if CONFIG_INTEGRITY_ASYMMETRIC_KEYS is enabled.
> 
> Finally, add the LSM_ID_IMA case in lsm_list_modules_test.c.
> 
> Signed-off-by: Roberto Sassu 
> Acked-by: Chuck Lever 
> ---
>  fs/file_table.c   |   2 -
>  fs/namei.c|   6 -
>  fs/nfsd/vfs.c     |   7 --
>  fs/open.c |   1 -

Acked-by: Christian Brauner

Re: [PATCH v9 22/25] evm: Move to LSM infrastructure

2024-02-09 Thread Christian Brauner

On Mon, Jan 15, 2024 at 07:18:06PM +0100, Roberto Sassu wrote:
> From: Roberto Sassu 
> 
> As for IMA, move hardcoded EVM function calls from various places in the
> kernel to the LSM infrastructure, by introducing a new LSM named 'evm'
> (last and always enabled like 'ima'). The order in the Makefile ensures
> that 'evm' hooks are executed after 'ima' ones.
> 
> Make EVM functions as static (except for evm_inode_init_security(), which
> is exported), and register them as hook implementations in init_evm_lsm().
> Also move the inline functions evm_inode_remove_acl(),
> evm_inode_post_remove_acl(), and evm_inode_post_set_acl() from the public
> evm.h header to evm_main.c.
> 
> Unlike before (see commit to move IMA to the LSM infrastructure),
> evm_inode_post_setattr(), evm_inode_post_set_acl(),
> evm_inode_post_remove_acl(), and evm_inode_post_removexattr() are not
> executed for private inodes.
> 
> Finally, add the LSM_ID_EVM case in lsm_list_modules_test.c
> 
> Signed-off-by: Roberto Sassu 
> Reviewed-by: Casey Schaufler 
> ---
>  fs/attr.c |   2 -
>  fs/posix_acl.c|   3 -
>  fs/xattr.c|   2 -

Acked-by: Christian Brauner

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 5019 matches

Mail list logo