[PATCH AUTOSEL 6.7 19/39] tracefs/eventfs: Use root and instance inodes as default ownership
(const char *name, umode_t mode, struct dentry *parent, void *data, const struct file_operations *fops) { + struct tracefs_inode *ti; struct dentry *dentry; struct inode *inode; @@ -621,7 +635,11 @@ struct dentry *tracefs_create_file(const char *name, umode_t mode, if (unlikely(!inode)) return tracefs_failed_creating(dentry); + ti = get_tracefs(inode); + ti->private = instance_inode(parent, inode); + inode->i_mode = mode; + inode->i_op = _file_inode_operations; inode->i_fop = fops ? fops : _file_operations; inode->i_private = data; inode->i_uid = d_inode(dentry->d_parent)->i_uid; @@ -634,6 +652,7 @@ struct dentry *tracefs_create_file(const char *name, umode_t mode, static struct dentry *__create_dir(const char *name, struct dentry *parent, const struct inode_operations *ops) { + struct tracefs_inode *ti; struct dentry *dentry = tracefs_start_creating(name, parent); struct inode *inode; @@ -651,6 +670,9 @@ static struct dentry *__create_dir(const char *name, struct dentry *parent, inode->i_uid = d_inode(dentry->d_parent)->i_uid; inode->i_gid = d_inode(dentry->d_parent)->i_gid; + ti = get_tracefs(inode); + ti->private = instance_inode(parent, inode); + /* directory inodes start off with i_nlink == 2 (for "." entry) */ inc_nlink(inode); d_instantiate(dentry, inode); @@ -681,7 +703,7 @@ struct dentry *tracefs_create_dir(const char *name, struct dentry *parent) if (security_locked_down(LOCKDOWN_TRACEFS)) return NULL; - return __create_dir(name, parent, _dir_inode_operations); + return __create_dir(name, parent, _dir_inode_operations); } /** @@ -712,7 +734,7 @@ __init struct dentry *tracefs_create_instance_dir(const char *name, if (WARN_ON(tracefs_ops.mkdir || tracefs_ops.rmdir)) return NULL; - dentry = __create_dir(name, parent, _dir_inode_operations); + dentry = __create_dir(name, parent, _instance_dir_inode_operations); if (!dentry) return NULL; diff --git a/fs/tracefs/internal.h b/fs/tracefs/internal.h index 42bdeb471a07..12b7d0150ae9 100644 --- a/fs/tracefs/internal.h +++ b/fs/tracefs/internal.h @@ -5,6 +5,9 @@ enum { TRACEFS_EVENT_INODE = BIT(1), TRACEFS_EVENT_TOP_INODE = BIT(2), + TRACEFS_GID_PERM_SET= BIT(3), + TRACEFS_UID_PERM_SET= BIT(4), + TRACEFS_INSTANCE_INODE = BIT(5), }; struct tracefs_inode { -- 2.43.0
Re: [RESEND] [PATCH] eventfs: Have inodes have unique inode numbers
On Fri, 26 Jan 2024 15:24:17 -0500 Mathieu Desnoyers wrote: > On 2024-01-26 15:12, Steven Rostedt wrote: > [...] > > diff --git a/fs/tracefs/inode.c b/fs/tracefs/inode.c > > index e1b172c0e091..2187be6d7b23 100644 > > --- a/fs/tracefs/inode.c > > +++ b/fs/tracefs/inode.c > > @@ -223,13 +223,41 @@ static const struct inode_operations > > tracefs_file_inode_operations = { > > .setattr= tracefs_setattr, > > }; > > > > +/* Copied from get_next_ino() but adds allocation for multiple inodes */ > > +#define LAST_INO_BATCH 1024 > > +#define LAST_INO_MASK (~(LAST_INO_BATCH - 1)) > > +static DEFINE_PER_CPU(unsigned int, last_ino); > > + > > +unsigned int tracefs_get_next_ino(int files) > > +{ > > + unsigned int *p = _cpu_var(last_ino); > > + unsigned int res = *p; > > + > > +#ifdef CONFIG_SMP > > + /* Check if adding files+1 overflows */ > > How does it handle a @files input where: > > * (files+1 > LAST_INO_BATCH) ? > > * (files+1 == LAST_INO_BATCH) ? Well, this is moot anyway, as Linus hates it. -- Steve
Re: [RESEND] [PATCH] eventfs: Have inodes have unique inode numbers
On 2024-01-26 15:12, Steven Rostedt wrote: [...] diff --git a/fs/tracefs/inode.c b/fs/tracefs/inode.c index e1b172c0e091..2187be6d7b23 100644 --- a/fs/tracefs/inode.c +++ b/fs/tracefs/inode.c @@ -223,13 +223,41 @@ static const struct inode_operations tracefs_file_inode_operations = { .setattr= tracefs_setattr, }; +/* Copied from get_next_ino() but adds allocation for multiple inodes */ +#define LAST_INO_BATCH 1024 +#define LAST_INO_MASK (~(LAST_INO_BATCH - 1)) +static DEFINE_PER_CPU(unsigned int, last_ino); + +unsigned int tracefs_get_next_ino(int files) +{ + unsigned int *p = _cpu_var(last_ino); + unsigned int res = *p; + +#ifdef CONFIG_SMP + /* Check if adding files+1 overflows */ How does it handle a @files input where: * (files+1 > LAST_INO_BATCH) ? * (files+1 == LAST_INO_BATCH) ? + if (unlikely(!res || (res & LAST_INO_MASK) != ((res + files + 1) & LAST_INO_MASK))) { + static atomic_t shared_last_ino; + int next = atomic_add_return(LAST_INO_BATCH, _last_ino); + + res = next - LAST_INO_BATCH; + } +#endif + + res++; + /* get_next_ino should not provide a 0 inode number */ + if (unlikely(!res)) + res++; I suspect that bumping this res++ in the 0 case can cause inode range reservation issues at (files+1 == LAST_INO_BATCH-1). Thanks, Mathieu + *p = res + files; + put_cpu_var(last_ino); + return res; +} -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com
[RESEND] [PATCH] eventfs: Have inodes have unique inode numbers
From: "Steven Rostedt (Google)" Linus suggested to use the same inode numbers to make it easier to implement getdents(), as it was creating inodes just for generating a unique and consistent inode number. Linus suggested to just use the same inode for all files and directories. Later it was discovered that having directories with the same inode number would mess up the "find" command, but Linus found that on 64 bit machines, there was a hole in the eventfs_inode structure due to alignment that could be used to store the inode numbers for directories. That fixed the directory issue, but the files still had their own inode number. The 'tar' command uses inode numbers for determining uniqueness between files, which this would break. Currently, tar is broken with tracefs because all files show a stat of zero size and tar doesn't copy anything. But because tar cares about inode numbers, there could be other applications that do too. It's best to have all files have unique inode numbers. Copy the get_next_ino() to tracefs_get_next_ino() that takes a "files" parameter. As eventfs directories have a fixed number of files within them, the number of inodes needed for the eventfs directory files is known when the directory is created. The tracefs_get_next_ino() will return a new inode number but also reserve the next "files" inode numbers that the caller is free to use. Then when an inode for a file is created, its inode number will be its parent directory's inode number plus the index into the file array of that directory, giving each file a unique inode number that can be retrieved at any time. Signed-off-by: Steven Rostedt (Google) --- [ Resending because I sent the first one to the wrong mailing list. ] fs/tracefs/event_inode.c | 31 +++ fs/tracefs/inode.c | 37 ++--- fs/tracefs/internal.h| 1 + 3 files changed, 50 insertions(+), 19 deletions(-) diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c index 6b211522a13e..7be7a694b106 100644 --- a/fs/tracefs/event_inode.c +++ b/fs/tracefs/event_inode.c @@ -32,14 +32,11 @@ */ static DEFINE_MUTEX(eventfs_mutex); -/* Choose something "unique" ;-) */ -#define EVENTFS_FILE_INODE_INO 0x12c4e37 - /* Just try to make something consistent and unique */ -static int eventfs_dir_ino(struct eventfs_inode *ei) +static int eventfs_dir_ino(struct eventfs_inode *ei, int nr_files) { if (!ei->ino) - ei->ino = get_next_ino(); + ei->ino = tracefs_get_next_ino(nr_files); return ei->ino; } @@ -327,6 +324,7 @@ void eventfs_update_gid(struct dentry *dentry, kgid_t gid) * @parent: parent dentry for this file. * @data: something that the caller will want to get to later on. * @fop: struct file_operations that should be used for this file. + * @ino: inode number for this file * * This function creates a dentry that represents a file in the eventsfs_inode * directory. The inode.i_private pointer will point to @data in the open() @@ -335,7 +333,8 @@ void eventfs_update_gid(struct dentry *dentry, kgid_t gid) static struct dentry *create_file(const char *name, umode_t mode, struct eventfs_attr *attr, struct dentry *parent, void *data, - const struct file_operations *fop) + const struct file_operations *fop, + unsigned int ino) { struct tracefs_inode *ti; struct dentry *dentry; @@ -363,9 +362,7 @@ static struct dentry *create_file(const char *name, umode_t mode, inode->i_op = _file_inode_operations; inode->i_fop = fop; inode->i_private = data; - - /* All files will have the same inode number */ - inode->i_ino = EVENTFS_FILE_INODE_INO; + inode->i_ino = ino; ti = get_tracefs(inode); ti->flags |= TRACEFS_EVENT_INODE; @@ -377,12 +374,14 @@ static struct dentry *create_file(const char *name, umode_t mode, /** * create_dir - create a dir in the tracefs filesystem * @ei: the eventfs_inode that represents the directory to create - * @parent: parent dentry for this file. + * @parent: parent dentry for this directory. + * @nr_files: The number of files (not directories) this directory has * * This function will create a dentry for a directory represented by * a eventfs_inode. */ -static struct dentry *create_dir(struct eventfs_inode *ei, struct dentry *parent) +static struct dentry *create_dir(struct eventfs_inode *ei, struct dentry *parent, +int nr_files) { struct tracefs_inode *ti; struct dentry *dentry; @@ -404,7 +403,7 @@ static struct dentry *create_dir(struct eventfs_inode *ei, struct dentry *parent inode->i_fop = _file_operations;
Re: [PATCH v3 1/2] eventfs: Have the inodes all for files and directories all be the same
On Mon, Jan 22, 2024 at 02:02:28PM -0800, Linus Torvalds wrote: > On Mon, 22 Jan 2024 at 13:59, Darrick J. Wong wrote: > > > > though I don't think > > leaking raw kernel pointers is an awesome idea. > > Yeah, I wasn't all that comfortable even with trying to hash it > (because I think the number of source bits is small enough that even > with a crypto hash, it's trivially brute-forceable). > > See > >https://lore.kernel.org/all/20240122152748.46897...@gandalf.local.home/ > > for the current patch under discussion (and it contains a link _to_ > said discussion). Ah, cool, thank you! --D >Linus
Re: [PATCH v3 1/2] eventfs: Have the inodes all for files and directories all be the same
On Mon, 22 Jan 2024 at 13:59, Darrick J. Wong wrote: > > though I don't think > leaking raw kernel pointers is an awesome idea. Yeah, I wasn't all that comfortable even with trying to hash it (because I think the number of source bits is small enough that even with a crypto hash, it's trivially brute-forceable). See https://lore.kernel.org/all/20240122152748.46897...@gandalf.local.home/ for the current patch under discussion (and it contains a link _to_ said discussion). Linus
Re: [PATCH v3 1/2] eventfs: Have the inodes all for files and directories all be the same
On Tue, Jan 16, 2024 at 05:55:32PM -0500, Steven Rostedt wrote: > From: "Steven Rostedt (Google)" > > The dentries and inodes are created in the readdir for the sole purpose of > getting a consistent inode number. Linus stated that is unnecessary, and > that all inodes can have the same inode number. For a virtual file system > they are pretty meaningless. > > Instead use a single unique inode number for all files and one for all > directories. > > Link: https://lore.kernel.org/all/20240116133753.2808d...@gandalf.local.home/ > Link: > https://lore.kernel.org/linux-trace-kernel/20240116211353.412180...@goodmis.org > > Cc: Masami Hiramatsu > Cc: Mark Rutland > Cc: Mathieu Desnoyers > Cc: Christian Brauner > Cc: Al Viro > Cc: Ajay Kaher > Suggested-by: Linus Torvalds > Signed-off-by: Steven Rostedt (Google) > --- > fs/tracefs/event_inode.c | 10 ++ > 1 file changed, 10 insertions(+) > > diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c > index fdff53d5a1f8..5edf0b96758b 100644 > --- a/fs/tracefs/event_inode.c > +++ b/fs/tracefs/event_inode.c > @@ -32,6 +32,10 @@ > */ > static DEFINE_MUTEX(eventfs_mutex); > > +/* Choose something "unique" ;-) */ > +#define EVENTFS_FILE_INODE_INO 0x12c4e37 > +#define EVENTFS_DIR_INODE_INO0x134b2f5 > + > /* > * The eventfs_inode (ei) itself is protected by SRCU. It is released from > * its parent's list and will have is_freed set (under eventfs_mutex). > @@ -352,6 +356,9 @@ static struct dentry *create_file(const char *name, > umode_t mode, > inode->i_fop = fop; > inode->i_private = data; > > + /* All files will have the same inode number */ > + inode->i_ino = EVENTFS_FILE_INODE_INO; > + > ti = get_tracefs(inode); > ti->flags |= TRACEFS_EVENT_INODE; > d_instantiate(dentry, inode); > @@ -388,6 +395,9 @@ static struct dentry *create_dir(struct eventfs_inode > *ei, struct dentry *parent > inode->i_op = _root_dir_inode_operations; > inode->i_fop = _file_operations; > > + /* All directories will have the same inode number */ > + inode->i_ino = EVENTFS_DIR_INODE_INO; Regrettably, this leads to find failing on 6.8-rc1 (see xfs/55[89] in fstests): # find /sys/kernel/debug/tracing/ >/dev/null find: File system loop detected; ‘/sys/kernel/debug/tracing/events/initcall/initcall_finish’ is part of the same file system loop as ‘/sys/kernel/debug/tracing/events/initcall’. find: File system loop detected; ‘/sys/kernel/debug/tracing/events/initcall/initcall_start’ is part of the same file system loop as ‘/sys/kernel/debug/tracing/events/initcall’. find: File system loop detected; ‘/sys/kernel/debug/tracing/events/initcall/initcall_level’ is part of the same file system loop as ‘/sys/kernel/debug/tracing/events/initcall’. There were no such reports on 6.7.0; AFAICT find(1) is tripping over parent and child subdirectory having the same dev/i_ino. Changing this line to the following: /* All directories will NOT have the same inode number */ inode->i_ino = (unsigned long)inode; makes the messages about filesystem loops go away, though I don't think leaking raw kernel pointers is an awesome idea. --D > + > ti = get_tracefs(inode); > ti->flags |= TRACEFS_EVENT_INODE; > > -- > 2.43.0 > > >
Re: [PATCH] eventfs: Save directory inodes in the eventfs_inode structure
On Mon, Jan 22, 2024 at 03:27:48PM -0500, Steven Rostedt wrote: > From: "Steven Rostedt (Google)" > > The eventfs inodes and directories are allocated when referenced. But this > leaves the issue of keeping consistent inode numbers and the number is > only saved in the inode structure itself. When the inode is no longer > referenced, it can be freed. When the file that the inode was representing > is referenced again, the inode is once again created, but the inode number > needs to be the same as it was before. > > Just making the inode numbers the same for all files is fine, but that > does not work with directories. The find command will check for loops via > the inode number and having the same inode number for directories triggers: > > # find /sys/kernel/tracing > find: File system loop detected; > '/sys/kernel/debug/tracing/events/initcall/initcall_finish' is part of the > same file system loop as > '/sys/kernel/debug/tracing/events/initcall'. > [..] > > Linus pointed out that the eventfs_inode structure ends with a single > 32bit int, and on 64 bit machines, there's likely a 4 byte hole due to > alignment. We can use this hole to store the inode number for the > eventfs_inode. All directories in eventfs are represented by an > eventfs_inode and that data structure can hold its inode number. > > That last int was also purposely placed at the end of the structure to > prevent holes from within. Now that there's a 4 byte number to hold the > inode, both the inode number and the last integer can be moved up in the > structure for better cache locality, where the llist and rcu fields can be > moved to the end as they are only used when the eventfs_inode is being > deleted. > > Link: > https://lore.kernel.org/all/CAMuHMdXKiorg-jiuKoZpfZyDJ3Ynrfb8=x+c7x0eewxn-yr...@mail.gmail.com/ > > Reported-by: Geert Uytterhoeven > Fixes: 53c41052ba31 ("eventfs: Have the inodes all for files and directories > all be the same") > Signed-off-by: Steven Rostedt (Google) Since I reviewed the earlier patch, I will repeat here for the formal one too. :) Thanks for avoiding the hashing! Reviewed-by: Kees Cook -- Kees Cook
[PATCH] eventfs: Save directory inodes in the eventfs_inode structure
From: "Steven Rostedt (Google)" The eventfs inodes and directories are allocated when referenced. But this leaves the issue of keeping consistent inode numbers and the number is only saved in the inode structure itself. When the inode is no longer referenced, it can be freed. When the file that the inode was representing is referenced again, the inode is once again created, but the inode number needs to be the same as it was before. Just making the inode numbers the same for all files is fine, but that does not work with directories. The find command will check for loops via the inode number and having the same inode number for directories triggers: # find /sys/kernel/tracing find: File system loop detected; '/sys/kernel/debug/tracing/events/initcall/initcall_finish' is part of the same file system loop as '/sys/kernel/debug/tracing/events/initcall'. [..] Linus pointed out that the eventfs_inode structure ends with a single 32bit int, and on 64 bit machines, there's likely a 4 byte hole due to alignment. We can use this hole to store the inode number for the eventfs_inode. All directories in eventfs are represented by an eventfs_inode and that data structure can hold its inode number. That last int was also purposely placed at the end of the structure to prevent holes from within. Now that there's a 4 byte number to hold the inode, both the inode number and the last integer can be moved up in the structure for better cache locality, where the llist and rcu fields can be moved to the end as they are only used when the eventfs_inode is being deleted. Link: https://lore.kernel.org/all/CAMuHMdXKiorg-jiuKoZpfZyDJ3Ynrfb8=x+c7x0eewxn-yr...@mail.gmail.com/ Reported-by: Geert Uytterhoeven Fixes: 53c41052ba31 ("eventfs: Have the inodes all for files and directories all be the same") Signed-off-by: Steven Rostedt (Google) --- fs/tracefs/event_inode.c | 14 +++--- fs/tracefs/internal.h| 7 --- 2 files changed, 15 insertions(+), 6 deletions(-) diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c index 6795fda2af19..6b211522a13e 100644 --- a/fs/tracefs/event_inode.c +++ b/fs/tracefs/event_inode.c @@ -34,7 +34,15 @@ static DEFINE_MUTEX(eventfs_mutex); /* Choose something "unique" ;-) */ #define EVENTFS_FILE_INODE_INO 0x12c4e37 -#define EVENTFS_DIR_INODE_INO 0x134b2f5 + +/* Just try to make something consistent and unique */ +static int eventfs_dir_ino(struct eventfs_inode *ei) +{ + if (!ei->ino) + ei->ino = get_next_ino(); + + return ei->ino; +} /* * The eventfs_inode (ei) itself is protected by SRCU. It is released from @@ -396,7 +404,7 @@ static struct dentry *create_dir(struct eventfs_inode *ei, struct dentry *parent inode->i_fop = _file_operations; /* All directories will have the same inode number */ - inode->i_ino = EVENTFS_DIR_INODE_INO; + inode->i_ino = eventfs_dir_ino(ei); ti = get_tracefs(inode); ti->flags |= TRACEFS_EVENT_INODE; @@ -802,7 +810,7 @@ static int eventfs_iterate(struct file *file, struct dir_context *ctx) name = ei_child->name; - ino = EVENTFS_DIR_INODE_INO; + ino = eventfs_dir_ino(ei_child); if (!dir_emit(ctx, name, strlen(name), ino, DT_DIR)) goto out_dec; diff --git a/fs/tracefs/internal.h b/fs/tracefs/internal.h index 12b7d0150ae9..45397df9bb65 100644 --- a/fs/tracefs/internal.h +++ b/fs/tracefs/internal.h @@ -55,6 +55,10 @@ struct eventfs_inode { struct eventfs_attr *entry_attrs; struct eventfs_attr attr; void*data; + unsigned intis_freed:1; + unsigned intis_events:1; + unsigned intnr_entries:30; + unsigned intino; /* * Union - used for deletion * @llist: for calling dput() if needed after RCU @@ -64,9 +68,6 @@ struct eventfs_inode { struct llist_node llist; struct rcu_head rcu; }; - unsigned intis_freed:1; - unsigned intis_events:1; - unsigned intnr_entries:30; }; static inline struct tracefs_inode *get_tracefs(const struct inode *inode) -- 2.43.0
[PATCH v3 2/2] eventfs: Do not create dentries nor inodes in iterate_shared
From: "Steven Rostedt (Google)" The original eventfs code added a wrapper around the dcache_readdir open callback and created all the dentries and inodes at open, and increment their ref count. A wrapper was added around the dcache_readdir release function to decrement all the ref counts of those created inodes and dentries. But this proved to be buggy[1] for when a kprobe was created during a dir read, it would create a dentry between the open and the release, and because the release would decrement all ref counts of all files and directories, that would include the kprobe directory that was not there to have its ref count incremented in open. This would cause the ref count to go to negative and later crash the kernel. To solve this, the dentries and inodes that were created and had their ref count upped in open needed to be saved. That list needed to be passed from the open to the release, so that the release would only decrement the ref counts of the entries that were incremented in the open. Unfortunately, the dcache_readdir logic was already using the file->private_data, which is the only field that can be used to pass information from the open to the release. What was done was the eventfs created another descriptor that had a void pointer to save the dcache_readdir pointer, and it wrapped all the callbacks, so that it could save the list of entries that had their ref counts incremented in the open, and pass it to the release. The wrapped callbacks would just put back the dcache_readdir pointer and call the functions it used so it could still use its data[2]. But Linus had an issue with the "hijacking" of the file->private_data (unfortunately this discussion was on a security list, so no public link). Which we finally agreed on doing everything within the iterate_shared callback and leave the dcache_readdir out of it[3]. All the information needed for the getents() could be created then. But this ended up being buggy too[4]. The iterate_shared callback was not the right place to create the dentries and inodes. Even Christian Brauner had issues with that[5]. An attempt was to go back to creating the inodes and dentries at the open, create an array to store the information in the file->private_data, and pass that information to the other callbacks.[6] The difference between that and the original method, is that it does not use dcache_readdir. It also does not up the ref counts of the dentries and pass them. Instead, it creates an array of a structure that saves the dentry's name and inode number. That information is used in the iterate_shared callback, and the array is freed in the dir release. The dentries and inodes created in the open are not used for the iterate_share or release callbacks. Just their names and inode numbers. Linus did not like that either[7] and just wanted to remove the dentries being created in iterate_shared and use the hard coded inode numbers. [ All this while Linus enjoyed an unexpected vacation during the merge window due to lack of power. ] [1] https://lore.kernel.org/linux-trace-kernel/20230919211804.230ed...@gandalf.local.home/ [2] https://lore.kernel.org/linux-trace-kernel/20230922163446.1431d...@gandalf.local.home/ [3] https://lore.kernel.org/linux-trace-kernel/20240104015435.682218...@goodmis.org/ [4] https://lore.kernel.org/all/202401152142.bfc28861-oliver.s...@intel.com/ [5] https://lore.kernel.org/all/20240111-unzahl-gefegt-433acb8a841d@brauner/ [6] https://lore.kernel.org/all/20240116114711.7e863...@gandalf.local.home/ [7] https://lore.kernel.org/all/20240116170154.5bf0a...@gandalf.local.home/ Link: https://lore.kernel.org/linux-trace-kernel/20240116211353.573784...@goodmis.org Cc: Masami Hiramatsu Cc: Mark Rutland Cc: Mathieu Desnoyers Cc: Linus Torvalds Cc: Christian Brauner Cc: Al Viro Cc: Ajay Kaher Fixes: 493ec81a8fb8 ("eventfs: Stop using dcache_readdir() for getdents()") Reported-by: kernel test robot Closes: https://lore.kernel.org/oe-lkp/202401152142.bfc28861-oliver.s...@intel.com Signed-off-by: Steven Rostedt (Google) --- fs/tracefs/event_inode.c | 20 +--- 1 file changed, 5 insertions(+), 15 deletions(-) diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c index 5edf0b96758b..10580d6b5012 100644 --- a/fs/tracefs/event_inode.c +++ b/fs/tracefs/event_inode.c @@ -727,8 +727,6 @@ static int eventfs_iterate(struct file *file, struct dir_context *ctx) struct eventfs_inode *ei_child; struct tracefs_inode *ti; struct eventfs_inode *ei; - struct dentry *ei_dentry = NULL; - struct dentry *dentry; const char *name; umode_t mode; int idx; @@ -749,11 +747,11 @@ static int eventfs_iterate(struct file *file, struct dir_context *ctx) mutex_lock(_mutex); ei = READ_ONCE(ti->private); - if (ei && !ei->is_freed) - ei_dentry = READ_ONCE(ei->dentry); +
[PATCH v3 0/2] eventfs: Create dentries and inodes at dir open
[ subject is still wrong, but is to match v2, see patch 2 for correct subject ] Changes since v2: https://lore.kernel.org/all/20240116211217.968123...@goodmis.org/ Implemented Linus's suggestion to just change the iterate_shared to use the hard coded inodes. Steven Rostedt (Google) (2): eventfs: Have the inodes all for files and directories all be the same eventfs: Do not create dentries nor inodes in iterate_shared fs/tracefs/event_inode.c | 30 +++--- 1 file changed, 15 insertions(+), 15 deletions(-)
[PATCH v3 1/2] eventfs: Have the inodes all for files and directories all be the same
From: "Steven Rostedt (Google)" The dentries and inodes are created in the readdir for the sole purpose of getting a consistent inode number. Linus stated that is unnecessary, and that all inodes can have the same inode number. For a virtual file system they are pretty meaningless. Instead use a single unique inode number for all files and one for all directories. Link: https://lore.kernel.org/all/20240116133753.2808d...@gandalf.local.home/ Link: https://lore.kernel.org/linux-trace-kernel/20240116211353.412180...@goodmis.org Cc: Masami Hiramatsu Cc: Mark Rutland Cc: Mathieu Desnoyers Cc: Christian Brauner Cc: Al Viro Cc: Ajay Kaher Suggested-by: Linus Torvalds Signed-off-by: Steven Rostedt (Google) --- fs/tracefs/event_inode.c | 10 ++ 1 file changed, 10 insertions(+) diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c index fdff53d5a1f8..5edf0b96758b 100644 --- a/fs/tracefs/event_inode.c +++ b/fs/tracefs/event_inode.c @@ -32,6 +32,10 @@ */ static DEFINE_MUTEX(eventfs_mutex); +/* Choose something "unique" ;-) */ +#define EVENTFS_FILE_INODE_INO 0x12c4e37 +#define EVENTFS_DIR_INODE_INO 0x134b2f5 + /* * The eventfs_inode (ei) itself is protected by SRCU. It is released from * its parent's list and will have is_freed set (under eventfs_mutex). @@ -352,6 +356,9 @@ static struct dentry *create_file(const char *name, umode_t mode, inode->i_fop = fop; inode->i_private = data; + /* All files will have the same inode number */ + inode->i_ino = EVENTFS_FILE_INODE_INO; + ti = get_tracefs(inode); ti->flags |= TRACEFS_EVENT_INODE; d_instantiate(dentry, inode); @@ -388,6 +395,9 @@ static struct dentry *create_dir(struct eventfs_inode *ei, struct dentry *parent inode->i_op = _root_dir_inode_operations; inode->i_fop = _file_operations; + /* All directories will have the same inode number */ + inode->i_ino = EVENTFS_DIR_INODE_INO; + ti = get_tracefs(inode); ti->flags |= TRACEFS_EVENT_INODE; -- 2.43.0
[PATCH v2 0/2] eventfs: Create dentries and inodes at dir open
[ subject is wrong, but is to match v1, see patch 2 for correct subject ] Fix reading dir again, this time without creating dentries and inodes. Changes since v1: https://lore.kernel.org/linux-trace-kernel/20240116114711.7e863...@gandalf.local.home Steven Rostedt (Google) (2): eventfs: Have the inodes all for files and directories all be the same eventfs: Create list of files and directories at dir open fs/tracefs/event_inode.c | 223 +-- 1 file changed, 159 insertions(+), 64 deletions(-)
[PATCH v2 1/2] eventfs: Have the inodes all for files and directories all be the same
From: "Steven Rostedt (Google)" The dentries and inodes are created in the readdir for the sole purpose of getting a consistent inode number. Linus stated that is unnecessary, and that all inodes can have the same inode number. For a virtual file system they are pretty meaningless. Instead use a single unique inode number for all files and one for all directories. Link: https://lore.kernel.org/all/20240116133753.2808d...@gandalf.local.home/ Suggested-by: Linus Torvalds Signed-off-by: Steven Rostedt (Google) --- fs/tracefs/event_inode.c | 10 ++ 1 file changed, 10 insertions(+) diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c index fdff53d5a1f8..5edf0b96758b 100644 --- a/fs/tracefs/event_inode.c +++ b/fs/tracefs/event_inode.c @@ -32,6 +32,10 @@ */ static DEFINE_MUTEX(eventfs_mutex); +/* Choose something "unique" ;-) */ +#define EVENTFS_FILE_INODE_INO 0x12c4e37 +#define EVENTFS_DIR_INODE_INO 0x134b2f5 + /* * The eventfs_inode (ei) itself is protected by SRCU. It is released from * its parent's list and will have is_freed set (under eventfs_mutex). @@ -352,6 +356,9 @@ static struct dentry *create_file(const char *name, umode_t mode, inode->i_fop = fop; inode->i_private = data; + /* All files will have the same inode number */ + inode->i_ino = EVENTFS_FILE_INODE_INO; + ti = get_tracefs(inode); ti->flags |= TRACEFS_EVENT_INODE; d_instantiate(dentry, inode); @@ -388,6 +395,9 @@ static struct dentry *create_dir(struct eventfs_inode *ei, struct dentry *parent inode->i_op = _root_dir_inode_operations; inode->i_fop = _file_operations; + /* All directories will have the same inode number */ + inode->i_ino = EVENTFS_DIR_INODE_INO; + ti = get_tracefs(inode); ti->flags |= TRACEFS_EVENT_INODE; -- 2.43.0
Re: [PATCH] eventfs: Create dentries and inodes at dir open
On Tue, 16 Jan 2024 10:53:36 -0800 Linus Torvalds wrote: > Let's at least *try* to move towards a better and simpler world, in other > words. OK, but I did just finish the below. I'll save it for another time if the single inode becomes an issue. I cropped it to just 31 bits, so I'm not sure how much that would still expose kernel addresses. But as you prefer a single inode number, I'll submit that instead, and store this as one of my many git branches that I hoard. This patch would give me: ~# ls -i /sys/kernel/tracing/events 2024739183 9p 801480813 iocost281269778 qdisc 401505785 alarmtimer 641794539 iomap1206040657 ras 1930290274 avc 1239754069 iommu 275816503 raw_syscalls 1108006611 block138722947 io_uring 1758073266 rcu 1528091656 bpf_test_run 694421747 ipi 1384703124 regmap 557364529 bpf_trace 2765888 irq30507018 regulator 1351362737 cfg80211 369905250 irq_matrix 2078862821 resctrl 886445085 cgroup 1115008967 irq_vectors 324090967 rpm 796209754 clk 1448778784 jbd2 1141318882 rseq 478739129 compaction99410253 kmem 1274783780 rtc 1714397712 cpuhp 2001779594 ksm 1409072807 sched 720701943 csd 51728677 kyber 347239934 scsi 1824588347 dev 1507974437 libata 1768671172 sd 1640041299 devfreq 1421146927 lock 1167562598 signal 121399632 dma_fence956825721 mac802114116590 skb 975908383 drm 738868061 maple_tree 1435501164 smbus 1227060804 e1000e_trace 969175760 mce 1664441095 sock 1770307058 enable 1225375766 mdio 1634697993 spi 1372107864 error_report 744198394 migrate 556841757 swiotlb 1356665351 exceptions 602669807 mmap 400337480 syscalls [..] ~# ls -i /sys/kernel/tracing/events/sched 1906992193 enable1210369853 sched_process_wait 592505447 filter1443020725 sched_stat_blocked 1742488280 sched_kthread_stop1213185672 sched_stat_iowait 1674576234 sched_kthread_stop_ret1908510325 sched_stat_runtime 1999376743 sched_kthread_work_execute_end1570203600 sched_stat_sleep 1041533096 sched_kthread_work_execute_start 608113040 sched_stat_wait 166824308 sched_kthread_work_queue_work 2033295833 sched_stick_numa 492462616 sched_migrate_task1617500395 sched_swap_numa 1786868196 sched_move_numa 1865216679 sched_switch 691687388 sched_pi_setprio 1465729401 sched_wait_task 1610977146 sched_process_exec 969941227 sched_wake_idle_without_ipi 610128037 sched_process_exit 84398030 sched_wakeup 2139284699 sched_process_fork 750220489 sched_wakeup_new 169172804 sched_process_free1024064201 sched_waking -- Steve diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c index a0598eb3e26e..57448bf4acb4 100644 --- a/fs/tracefs/event_inode.c +++ b/fs/tracefs/event_inode.c @@ -19,6 +19,7 @@ #include #include #include +#include #include #include #include @@ -32,6 +33,9 @@ */ static DEFINE_MUTEX(eventfs_mutex); +/* Used for making inode numbers */ +static siphash_key_t inode_key; + /* * The eventfs_inode (ei) itself is protected by SRCU. It is released from * its parent's list and will have is_freed set (under eventfs_mutex). @@ -57,6 +61,28 @@ static int eventfs_dir_open(struct inode *inode, struct file *file); static int eventfs_iterate(struct file *file, struct dir_context *ctx); static int eventfs_dir_release(struct inode *inode, struct file *file); +/* Copied from scripts/kconfig/symbol.c */ +static unsigned strhash(const char *s) +{ + /* fnv32 hash */ + unsigned hash = 2166136261U; + for (; *s; s++) + hash = (hash ^ *s) * 0x01000193; + return hash; +} + +/* Just try to make something consistent and unique */ +static int eventfs_inode_ino(struct eventfs_inode *ei, const char *name) +{ + unsigned long sip = (unsigned long)ei; + + sip += strhash(name); + sip = siphash_1u32((int)sip, _key); + + /* keep it positive */ + return sip & ((1U << 31) - 1); +} + static void update_attr(struct eventfs_attr *attr, struct iattr *iattr) { unsigned int ia_valid = iattr->ia_valid; @@ -491,6 +517,8 @@ create_file_dentry(struct eventfs_inode *ei, int idx, mutex_unlock(_mutex); dentry = create_file(name, mode, attr, parent, data, fops); + if (!IS_ERR_OR_NULL(dentry)) + dentry->d_inode->i_ino = eventfs_inode_ino(ei, name); mutex_lock(_mutex); @@ -585,6 +613,8 @@ create_dir_dentry(struct eventfs_inode *pei, struct eventfs_inode *ei, mutex_unlock(_mutex); dentry = create_dir(ei, parent); + if (!IS_ERR_OR_NULL(dentry)) +
Re: [PATCH] eventfs: Create dentries and inodes at dir open
On Tue, 16 Jan 2024 10:21:49 -0800 Linus Torvalds wrote: > Here's a clue: just fix your inode numbers. > > I can think of many ways to do it. Here's a couple: > > - use a fixed inode number for all inodes. It's fine. Really. You might > confuse some programs that still do getpwd() the legacy way, but hey, > nobody cares > > - just put the inode number in the same data structure everything else is > > > - make the inode number be a hash of the address of your data structure. > That's actually the closest to a traditional "real" inode number, which is > just an index to some on-disk thing > > I'm sure there are other *trivial* solutions. > > None of this is an excuse to misuse sentries. > > Try the first one - one single inode number - first. You shouldn't be doing > iget() anyway, so why do you care so deeply about a number that makes no > sense and nobody should care about? It was me being paranoid that using the same inode number would break user space. If that is not a concern, then I'm happy to just make it either the same, or maybe just hash the ei and name that it is associated with. If I do not fully understand how something is used, I try hard to make it act the same as it does for other use cases. That is, I did all this to keep inodes unique and consistent because I did not know if it would break something if I didn't. Removing that requirement does make it much easier to implement readdir. I think I'll do the hashing, just because I'm still paranoid that something might still break if they are all the same. -- Steve
Re: [PATCH] eventfs: Create dentries and inodes at dir open
On Tue, 16 Jan 2024 13:12:28 -0500 Steven Rostedt wrote: > Maybe I can just use a hash to generate he inode numbers from the name? > Hopefully there will be no collisions. Then I don't need the dentry > creation at all. Maybe I could use a hash of the address of the meta data to create the inode number? That may be unique enough. -- Steve
Re: [PATCH] eventfs: Create dentries and inodes at dir open
On Tue, 16 Jan 2024 09:55:15 -0800 Linus Torvalds wrote: > [ html crud because I still don't have power or real Internet, just trying > to keep an eye on things on my phone. Mailing lists removed to avoid > bounces, please put them back in replies that don't have my horrible > formatting ] > > No. > > Christ, you're making things worse again > > The reason for the bug is that you're still messing with the dentries at > readdir() time. I may have deleted the comment, but the only reason I created the inodes/destries is to keep the consistent inode number as it's created at the time the inodes and dentries are. Ah I did delete the comment, but it is still applicable :-/ /* -* Need to create the dentries and inodes to have a consistent -* inode number. +* Need to make a struct eventfs_dent array, start by +* allocating enough for just the files, which is a fixed +* array. Then use realloc via add_entry() for the directories +* which is stored in a linked list. */ So if for some reason, user space did a getdents() and used the inode numbers to match what it found, they would likely be the same. > > Just stop it. Readdir should not be creating dentries. Readdir should not > be even *looking* at dentries. You're not a virtual filesystem, the > dentries are just caches for filename lookup, and have *nothing* to do with > readdir. Actually, it's not looking at them. I did it as a way to just have the inode numbers be consistent. dents[cnt].ino = d->d_inode->i_ino; Yes, that's the only reason I create them. The dentry/inode is not used for anything outside of that. > > So just iterate over your own internal data structures in readdir. DO NOT > CREATE DENTRIES OR INODES FOR READDIR. > > I've told you before, and I'll tell you again: either you are a real and > proper virtual filesystem and you let the vfs layer manage *everything*, > and the dentries and inodes are all you have. Or you are a *real* > filesystem and you maintain your own data structures and the dentries and > inodes are just the in-memory caches. > > This "do both" is UNACCEPTABLE. > The dentries were created for the inode numbers so that I did not need to add them to meta data. They are generated at creation time. I don't know how important inode numbers are. If they can all just have random inode numbers and it doesn't break user space, where an inode number will be one value at one read and another shortly after, is that going to cause a problem? Maybe I can just use a hash to generate he inode numbers from the name? Hopefully there will be no collisions. Then I don't need the dentry creation at all. I do realize that if the dentries get freed due to reclaim and recreated, their inode numbers will be different. But for a virtual file system, I don't know how important having consistent inode numbers is. -- Steve
[PATCH] eventfs: Create dentries and inodes at dir open
From: "Steven Rostedt (Google)" The original eventfs code added a wrapper around the dcache_readdir open callback and created all the dentries and inodes at open, and increment their ref count. A wrapper was added around the dcache_readdir release function to decrement all the ref counts of those created inodes and dentries. But this proved to be buggy[1] for when a kprobe was created during a dir read, it would create a dentry between the open and the release, and because the release would decrement all ref counts of all files and directories, that would include the kprobe directory that was not there to have its ref count incremented in open. This would cause the ref count to go to negative and later crash the kernel. To solve this, the dentries and inodes that were created and had their ref count upped in open needed to be saved. That list needed to be passed from the open to the release, so that the release would only decrement the ref counts of the entries that were incremented in the open. Unfortunately, the dcache_readdir logic was already using the file->private_data, which is the only field that can be used to pass information from the open to the release. What was done was the eventfs created another descriptor that had a void pointer to save the dcache_readdir pointer, and it wrapped all the callbacks, so that it could save the list of entries that had their ref counts incremented in the open, and pass it to the release. The wrapped callbacks would just put back the dcache_readdir pointer and call the functions it used so it could still use its data[2]. But Linus had an issue with the "hijacking" of the file->private_data (unfortunately this discussion was on a security list, so no public link). Which we finally agreed on doing everything within the iterate_shared callback and leave the dcache_readdir out of it[3]. All the information needed for the getents() could be created then. But this ended up being buggy too[4]. The iterate_shared callback was not the right place to create the dentries and inodes. Even Christian Brauner had issues with that[5]. The real fix should be to go back to creating the inodes and dentries at the open, create an array to store the information in the file->private_data, and pass that information to the other callbacks. The difference between this and the original method, is that it does not use dcache_readdir. It also does not up the ref counts of the dentries and pass them. Instead, it creates an array of a structure that saves the dentry's name and inode number. That information is used in the iterate_shared callback, and the array is freed in the dir release. The dentries and inodes created in the open are not used for the iterate_share or release callbacks. Just their names and inode numbers. This means that the state of the eventfs at the dir open remains the same from the point of view of the getdents() function, until the dir is closed. This also means that the getdents() will not fail. If there's an issue, it fails at the dir open. [1] https://lore.kernel.org/linux-trace-kernel/20230919211804.230ed...@gandalf.local.home/ [2] https://lore.kernel.org/linux-trace-kernel/20230922163446.1431d...@gandalf.local.home/ [3] https://lore.kernel.org/linux-trace-kernel/20240104015435.682218...@goodmis.org/ [4] https://lore.kernel.org/all/202401152142.bfc28861-oliver.s...@intel.com/ [5] https://lore.kernel.org/all/20240111-unzahl-gefegt-433acb8a841d@brauner/ Fixes: 493ec81a8fb8 ("eventfs: Stop using dcache_readdir() for getdents()") Reported-by: kernel test robot Closes: https://lore.kernel.org/oe-lkp/202401152142.bfc28861-oliver.s...@intel.com Signed-off-by: Steven Rostedt (Google) --- fs/tracefs/event_inode.c | 209 --- 1 file changed, 153 insertions(+), 56 deletions(-) diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c index fdff53d5a1f8..50b37f31d835 100644 --- a/fs/tracefs/event_inode.c +++ b/fs/tracefs/event_inode.c @@ -53,7 +53,9 @@ enum { static struct dentry *eventfs_root_lookup(struct inode *dir, struct dentry *dentry, unsigned int flags); +static int eventfs_dir_open(struct inode *inode, struct file *file); static int eventfs_iterate(struct file *file, struct dir_context *ctx); +static int eventfs_dir_release(struct inode *inode, struct file *file); static void update_attr(struct eventfs_attr *attr, struct iattr *iattr) { @@ -212,7 +214,9 @@ static const struct inode_operations eventfs_file_inode_operations = { static const struct file_operations eventfs_file_operations = { .read = generic_read_dir, + .open = eventfs_dir_open, .iterate_shared = eventfs_iterate, + .release= eventfs_dir_release, .llseek = generic_file_llseek, }; @@ -706,34 +710,71 @@ static struct dentry *eventfs_root_lo
Re: [PATCH] tracefs/eventfs: Use root and instance inodes as default ownership
On Fri, 12 Jan 2024 08:53:44 -0500 Steven Rostedt wrote: > > // We managed to open the directory so we have permission to list > > // directory entries in "xfs". > > fd = open("/sys/kernel/tracing/events/xfs"); > > > > // Remove ownership so we can't open the directory anymore > > chown("/sys/kernel/tracing/events/xfs", 0, 0); > > > > // Or just remove exec bit for the group and restrict to owner > > chmod("/sys/kernel/tracing/events/xfs", 700); > > > > // Drop caches to force an eventfs_root_lookup() on everything > > write("/proc/sys/vm/drop_caches", "3", 1); > > This requires opening the directory, then having it's permissions > change, and then immediately dropping the caches. > > > > > // Returns 0 even though directory has a lot of entries and we should be > > // able to list them > > getdents64(fd, ...); > > And do we care? Hmm, maybe the issue you have is with the inconsistency of the action? For this to fail, it would require the admin to do both change the permission and to flush caches. If you don't flush the caches then the task with the dir open can still read it regardless. If the dentries were already created. In that case I'm fine if we change the creation of the dentries to not check the permission. But for now, it's just a weird side effect that I don't really see how it would affect any user's workflow. -- Steve
Re: [PATCH] tracefs/eventfs: Use root and instance inodes as default ownership
On Fri, 12 Jan 2024 09:27:24 +0100 Christian Brauner wrote: > On Thu, Jan 11, 2024 at 04:53:19PM -0500, Steven Rostedt wrote: > > On Thu, 11 Jan 2024 22:01:32 +0100 > > Christian Brauner wrote: > > > > > What I'm pointing out in the current logic is that the caller is > > > taxed twice: > > > > > > (1) Once when the VFS has done inode_permission(MAY_EXEC, "xfs") > > > (2) And again when you call lookup_one_len() in eventfs_start_creating() > > > _because_ the permission check in lookup_one_len() is the exact > > > same permission check again that the vfs has done > > > inode_permission(MAY_EXEC, "xfs"). > > > > As I described in: > > https://lore.kernel.org/all/20240110133154.6e18f...@gandalf.local.home/ > > > > The eventfs files below "events" doesn't need the .permissions callback at > > all. It's only there because the "events" inode uses it. > > > > The .permissions call for eventfs has: > > It doesn't matter whether there's a ->permission handler. If you don't > add one explicitly the VFS will simply call generic_permission(): > > inode_permission() > -> do_inode_permission() >{ > if (unlikely(!(inode->i_opflags & IOP_FASTPERM))) { > if (likely(inode->i_op->permission)) > return inode->i_op->permission(idmap, inode, mask); > > /* This gets set once for the inode lifetime */ > spin_lock(>i_lock); > inode->i_opflags |= IOP_FASTPERM; > spin_unlock(>i_lock); > } > return generic_permission(idmap, inode, mask); >} Yes I know that, because that's where I knew what to call in the non "events" dir case. > > > Anyway, the issue is with "events" directory and remounting, because like > > the tracefs system, the inode and dentry for "evnets" is created at boot > > up, before the mount happens. The VFS layer is going to check the > > permissions of its inode and dentry, which will be incorrect if the mount > > was mounted with a "gid" option. > > The gid option has nothing to do with this and it is just handled fine > if you remove the second permission checking in (2). I guess I'm confused to what you are having an issue with. Is it just that the permission check gets called twice? > > You need to remove the inode_permission() code from > eventfs_start_creating(). It is just an internal lookup and the fact > that you have it in there allows userspace to break readdir on the > eventfs portions of tracefs as I've shown in the parts of the mail that > you cut off. That's because I didn't see how it was related to the way I fixed the mount=gid issue. Are you only concerned because of the check in eventfs_start_creating()? Yes, you posted code that would make things act funny for some code that I see no real world use case for. Yeah, it may not act "properly" but I'm not sure that's bad. Here, I'll paste it back: > // We managed to open the directory so we have permission to list > // directory entries in "xfs". > fd = open("/sys/kernel/tracing/events/xfs"); > > // Remove ownership so we can't open the directory anymore > chown("/sys/kernel/tracing/events/xfs", 0, 0); > > // Or just remove exec bit for the group and restrict to owner > chmod("/sys/kernel/tracing/events/xfs", 700); > > // Drop caches to force an eventfs_root_lookup() on everything > write("/proc/sys/vm/drop_caches", "3", 1); This requires opening the directory, then having it's permissions change, and then immediately dropping the caches. > > // Returns 0 even though directory has a lot of entries and we should be > // able to list them > getdents64(fd, ...); And do we care? Since tracing exposes internal kernel information, perhaps this is a feature and not a bug. If someone who had access to the tracing system and you wanted to stop them, if they had a directory open that they no longer have access to, you don't want them to see what's left in the directory. In other words, I like the idea that the getends64(fd, ...) will fail! If there's a file underneath that wasn't change, and the admin thought that just keeping the top directory permissions off is good enough, then that attacker having that directory open before the directory had it's file permissions change is a way to still have access to the files below it. > > And the failure is in the inode_permission(MAY_EXEC, "xfs") call in > lookup_one_len() in eventfs_start_creating() which now fails. And I think is a good thing! Again, tracefs is special. It gives you access and possibly control to the kernel behavior. I like the fact that as soon as someone loses permission to a directory, they immediately lose it. -- Steve
Re: [PATCH] tracefs/eventfs: Use root and instance inodes as default ownership
On Thu, Jan 11, 2024 at 04:53:19PM -0500, Steven Rostedt wrote: > On Thu, 11 Jan 2024 22:01:32 +0100 > Christian Brauner wrote: > > > What I'm pointing out in the current logic is that the caller is > > taxed twice: > > > > (1) Once when the VFS has done inode_permission(MAY_EXEC, "xfs") > > (2) And again when you call lookup_one_len() in eventfs_start_creating() > > _because_ the permission check in lookup_one_len() is the exact > > same permission check again that the vfs has done > > inode_permission(MAY_EXEC, "xfs"). > > As I described in: > https://lore.kernel.org/all/20240110133154.6e18f...@gandalf.local.home/ > > The eventfs files below "events" doesn't need the .permissions callback at > all. It's only there because the "events" inode uses it. > > The .permissions call for eventfs has: It doesn't matter whether there's a ->permission handler. If you don't add one explicitly the VFS will simply call generic_permission(): inode_permission() -> do_inode_permission() { if (unlikely(!(inode->i_opflags & IOP_FASTPERM))) { if (likely(inode->i_op->permission)) return inode->i_op->permission(idmap, inode, mask); /* This gets set once for the inode lifetime */ spin_lock(>i_lock); inode->i_opflags |= IOP_FASTPERM; spin_unlock(>i_lock); } return generic_permission(idmap, inode, mask); } > Anyway, the issue is with "events" directory and remounting, because like > the tracefs system, the inode and dentry for "evnets" is created at boot > up, before the mount happens. The VFS layer is going to check the > permissions of its inode and dentry, which will be incorrect if the mount > was mounted with a "gid" option. The gid option has nothing to do with this and it is just handled fine if you remove the second permission checking in (2). You need to remove the inode_permission() code from eventfs_start_creating(). It is just an internal lookup and the fact that you have it in there allows userspace to break readdir on the eventfs portions of tracefs as I've shown in the parts of the mail that you cut off.
Re: [PATCH] tracefs/eventfs: Use root and instance inodes as default ownership
On Thu, 11 Jan 2024 22:01:32 +0100 Christian Brauner wrote: > What I'm pointing out in the current logic is that the caller is > taxed twice: > > (1) Once when the VFS has done inode_permission(MAY_EXEC, "xfs") > (2) And again when you call lookup_one_len() in eventfs_start_creating() > _because_ the permission check in lookup_one_len() is the exact > same permission check again that the vfs has done > inode_permission(MAY_EXEC, "xfs"). As I described in: https://lore.kernel.org/all/20240110133154.6e18f...@gandalf.local.home/ The eventfs files below "events" doesn't need the .permissions callback at all. It's only there because the "events" inode uses it. The .permissions call for eventfs has: static int eventfs_permission(struct mnt_idmap *idmap, struct inode *inode, int mask) { set_top_events_ownership(inode); return generic_permission(idmap, inode, mask); } Where the "set_top_events_ownership() is a nop for everything but the "events" directory. I guess I could have two ops: static const struct inode_operations eventfs_root_dir_inode_operations = { .lookup = eventfs_root_lookup, .setattr= eventfs_set_attr, .getattr= eventfs_get_attr, .permission = eventfs_permission, }; static const struct inode_operations eventfs_dir_inode_operations = { .lookup = eventfs_root_lookup, .setattr= eventfs_set_attr, .getattr= eventfs_get_attr, }; And use the second one for all dentries below the root, but I figured it's not that big of a deal if I called the permissions on all. Perhaps I should do it with two? Anyway, the issue is with "events" directory and remounting, because like the tracefs system, the inode and dentry for "evnets" is created at boot up, before the mount happens. The VFS layer is going to check the permissions of its inode and dentry, which will be incorrect if the mount was mounted with a "gid" option. -- Steve
Re: [PATCH] tracefs/eventfs: Use root and instance inodes as default ownership
On Wed, Jan 10, 2024 at 08:07:46AM -0500, Steven Rostedt wrote: > On Wed, 10 Jan 2024 12:45:36 +0100 > Christian Brauner wrote: > > > So say you do: > > > > mkdir /sys/kernel/tracing/instances/foo > > > > After this has returned we know everything we need to know about the new > > tracefs instance including the ownership and the mode of all inodes in > > /sys/kernel/tracing/instances/foo/events/* and below precisely because > > ownership is always inherited from the parent dentry and recorded in the > > metadata struct eventfs_inode. > > > > So say someone does: > > > > open("/sys/kernel/tracing/instances/foo/events/xfs"); > > > > and say this is the first time that someone accesses that events/ > > directory. > > > > When the open pathwalk is done, the vfs will determine via > > > > [1] may_lookup(inode_of(events)) > > > > whether you are able to list entries such as "xfs" in that directory. > > The vfs checks inode_permission(MAY_EXEC) on "events" and if that holds > > it ends up calling i_op->eventfs_root_lookup(events). > > > > At this point tracefs/eventfs adds the inodes for all entries in that > > "events" directory including "xfs" based on the metadata it recorded > > during the mkdir. Since now someone is actually interested in them. And > > it initializes the inodes with ownership and everything and adds the > > dentries that belong into that directory. > > > > Nothing here depends on the permissions of the caller. The only > > permission that mattered was done in the VFS in [1]. If the caller has > > permissions to enter a directory they can lookup and list its contents. > > And its contents where determined/fixed etc when mkdir was called. > > > > So we just need to add the required objects into the caches (inode, > > dentry) whose addition we intentionally defered until someone actually > > needed them. > > > > So, eventfs_root_lookup() now initializes the inodes with the ownership > > from the stored metadata or from the parent dentry and splices in inodes > > and dentries. No permission checking is needed for this because it is > > always a recheck of what the vfs did in [1]. > > > > We now return to the vfs and path walk continues to the final component > > that you actually want to open which is that "xfs" directory in this > > example. We check the permissions on that inode via may_open("xfs") and > > we open that directory returning an fd to userspace ultimately. > > > > (I'm going by memory since I need to step out the door.) > > So, let's say we do: > > chgrp -R rostedt /sys/kernel/tracing/ The rostedt group needs exec permissions and "other" cannot have exec permissions otherwise you can trivially list the entries even if it's owned by root: chmod 750 /sys/kernel/tracing user1@localhost:~$ ls -aln /sys/kernel/ | grep tracing drwxr-x--- 6 0 10000 Jan 11 18:23 tracing > > But I don't want rostedt to have access to xfs > > chgrp -R root /sys/kernel/tracing/events/xfs chmod 750 /sys/kernel/tracing/events/xfs user1@localhost:~$ ls -aln /sys/kernel/tracing/events/ | grep xfs drwxr-x--- 601 0 0 0 Jan 11 18:24 xfs This ensure that if a user is in the group and the group has exec perms lookup is possible (For root this will usually work because CAP_DAC_READ_SEARCH overrides the exec requirement.). > > Both actions will create the inodes and dentries of all files and > directories (because of "-R"). But once that is done, the ref counts go to > zero. They stay around until reclaim. But then I open Chrome ;-) and it > reclaims all the dentries and inodes, so we are back to here we were on > boot. > > Now as rostedt I do: > > ls /sys/kernel/tracing/events/xfs > > The VFS layer doesn't know if I have permission to that or not, because all > the inodes and dentries have been freed. It has to call back to eventfs to > find out. Which the eventfs_root_lookup() and eventfs_iterate_shared() will > recreated the inodes with the proper permission. Very roughly, ignoring most of the complexity of lookup and focussing on the permission checking: When a caller looks up an entry in a directory then the VFS will call inode_permission(MAY_EXEC) on the directory the caller is trying to perform that lookup in. If the caller wants to lookup the "events" entry in the "tracing" directory then the VFS will call inode_permission(MAY_EXEC, "tracing") and then - assuming it's not in the cache - call into the lookup method of the filesystem. After the VFS has determin
Re: [PATCH] tracefs/eventfs: Use root and instance inodes as default ownership
On Wed, 10 Jan 2024 10:52:51 -0500 Steven Rostedt wrote: > On Wed, 10 Jan 2024 08:07:46 -0500 > Steven Rostedt wrote: > > > Or are you saying that I don't need the ".permission" callback, because > > eventfs does it when it creates the inodes? But for eventfs to know what > > the permissions changes are, it uses .getattr and .setattr. > > OK, if your main argument is that we do not need .permission, I agree with > you. But that's a trivial change and doesn't affect the complexity that > eventfs is doing. In fact, removing the "permission" check is simply this > patch: > > -- > diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c > index fdff53d5a1f8..f2af07a857e2 100644 > --- a/fs/tracefs/event_inode.c > +++ b/fs/tracefs/event_inode.c > @@ -192,18 +192,10 @@ static int eventfs_get_attr(struct mnt_idmap *idmap, > return 0; > } > > -static int eventfs_permission(struct mnt_idmap *idmap, > - struct inode *inode, int mask) > -{ > - set_top_events_ownership(inode); > - return generic_permission(idmap, inode, mask); > -} > - > static const struct inode_operations eventfs_root_dir_inode_operations = { > .lookup = eventfs_root_lookup, > .setattr= eventfs_set_attr, > .getattr= eventfs_get_attr, > - .permission = eventfs_permission, > }; > > static const struct inode_operations eventfs_file_inode_operations = { > -- > > I only did that because Linus mentioned it, and I thought it was needed. > I'll apply this patch too, as it appears to work with this code. Oh, eventfs files and directories don't need the .permissions because its inodes and dentries are not created until accessed. But the "events" directory itself has its dentry and inode created at boot up, but still uses the eventfs_root_dir_inode_operations. So the .permissions is still needed! If you look at the "set_top_events_ownership()" function, it has: /* The top events directory doesn't get automatically updated */ if (!ei || !ei->is_events || !(ei->attr.mode & EVENTFS_TOPLEVEL)) return; That is, it does nothing if the entry is not the "events" directory. It falls back to he default "->permissions()" function for everything but the top level "events" directory. But this and .getattr are still needed for the events directory, because it suffers the same issue as the other tracefs entries. That is, it's inodes and dentries are created at boot up before it is mounted. So if the mount has gid=1000, it will be ignored. The .getattr is called by "stat" which ls does. So after boot up if you just do: # chmod 0750 /sys/kernel/events # chmod 0770 /sys/kernel/tracing # mount -o remount,gid=1000 /sys/kernel/tracing # su - rostedt $ id uid=1000(rostedt) gid=1000(rostedt) groups=1000(rostedt) $ ls /sys/kernel/tracing/events/ 9pext4iomapmodule raw_syscalls thermal alarmtimerfib iommumsr rcu thp avc fib6io_uring napiregmaptimer block filelockipi neigh regulator tlb bpf_test_run filemap irq net resctrl udp bpf_trace ftrace irq_matrix netfs rpm virtio_gpu[ ...] The above works because "ls" does a stat() on the directory first, which does a .getattr() call that updates the permissions of the existing "events" directory inode. BUT! If I had used my own getents() program that has: fd = openat(AT_FDCWD, argv[1], O_RDONLY); if (fd < 0) perror("openat"); n = getdents64(fd, buf, BUF_SIZE); if (n < 0) perror("getdents64"); Where it calls the openat() without doing a stat fist, and after boot, had done: # chmod 0750 /sys/kernel/events # chmod 0770 /sys/kernel/tracing # mount -o remount,gid=1000 /sys/kernel/tracing # su - rostedt $ id uid=1000(rostedt) gid=1000(rostedt) groups=1000(rostedt) $ ./getdents /sys/kernel/tracing/events openat: Permission denied getdents64: Bad file descriptor It errors because he "events" inode permission hasn't been updated yet. Now after getting the above error, if I do the "ls" and then run it again: $ ls /sys/kernel/tracing/events > /dev/null $ ./getdents /sys/kernel/tracing/events enable header_page header_event initcall vsyscall syscalls it works! so no, I can't remove that .permissions callback from eventfs. -- Steve
Re: [PATCH] tracefs/eventfs: Use root and instance inodes as default ownership
On Wed, 10 Jan 2024 10:52:51 -0500 Steven Rostedt wrote: > I'll apply this patch too, as it appears to work with this code. I meant "appears to work without this code". -- Steve
Re: [PATCH] tracefs/eventfs: Use root and instance inodes as default ownership
On Wed, 10 Jan 2024 08:07:46 -0500 Steven Rostedt wrote: > Or are you saying that I don't need the ".permission" callback, because > eventfs does it when it creates the inodes? But for eventfs to know what > the permissions changes are, it uses .getattr and .setattr. OK, if your main argument is that we do not need .permission, I agree with you. But that's a trivial change and doesn't affect the complexity that eventfs is doing. In fact, removing the "permission" check is simply this patch: -- diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c index fdff53d5a1f8..f2af07a857e2 100644 --- a/fs/tracefs/event_inode.c +++ b/fs/tracefs/event_inode.c @@ -192,18 +192,10 @@ static int eventfs_get_attr(struct mnt_idmap *idmap, return 0; } -static int eventfs_permission(struct mnt_idmap *idmap, - struct inode *inode, int mask) -{ - set_top_events_ownership(inode); - return generic_permission(idmap, inode, mask); -} - static const struct inode_operations eventfs_root_dir_inode_operations = { .lookup = eventfs_root_lookup, .setattr= eventfs_set_attr, .getattr= eventfs_get_attr, - .permission = eventfs_permission, }; static const struct inode_operations eventfs_file_inode_operations = { -- I only did that because Linus mentioned it, and I thought it was needed. I'll apply this patch too, as it appears to work with this code. Thanks! -- Steve
Re: [PATCH] tracefs/eventfs: Use root and instance inodes as default ownership
On Wed, 10 Jan 2024 12:45:36 +0100 Christian Brauner wrote: > So say you do: > > mkdir /sys/kernel/tracing/instances/foo > > After this has returned we know everything we need to know about the new > tracefs instance including the ownership and the mode of all inodes in > /sys/kernel/tracing/instances/foo/events/* and below precisely because > ownership is always inherited from the parent dentry and recorded in the > metadata struct eventfs_inode. > > So say someone does: > > open("/sys/kernel/tracing/instances/foo/events/xfs"); > > and say this is the first time that someone accesses that events/ > directory. > > When the open pathwalk is done, the vfs will determine via > > [1] may_lookup(inode_of(events)) > > whether you are able to list entries such as "xfs" in that directory. > The vfs checks inode_permission(MAY_EXEC) on "events" and if that holds > it ends up calling i_op->eventfs_root_lookup(events). > > At this point tracefs/eventfs adds the inodes for all entries in that > "events" directory including "xfs" based on the metadata it recorded > during the mkdir. Since now someone is actually interested in them. And > it initializes the inodes with ownership and everything and adds the > dentries that belong into that directory. > > Nothing here depends on the permissions of the caller. The only > permission that mattered was done in the VFS in [1]. If the caller has > permissions to enter a directory they can lookup and list its contents. > And its contents where determined/fixed etc when mkdir was called. > > So we just need to add the required objects into the caches (inode, > dentry) whose addition we intentionally defered until someone actually > needed them. > > So, eventfs_root_lookup() now initializes the inodes with the ownership > from the stored metadata or from the parent dentry and splices in inodes > and dentries. No permission checking is needed for this because it is > always a recheck of what the vfs did in [1]. > > We now return to the vfs and path walk continues to the final component > that you actually want to open which is that "xfs" directory in this > example. We check the permissions on that inode via may_open("xfs") and > we open that directory returning an fd to userspace ultimately. > > (I'm going by memory since I need to step out the door.) So, let's say we do: chgrp -R rostedt /sys/kernel/tracing/ But I don't want rostedt to have access to xfs chgrp -R root /sys/kernel/tracing/events/xfs Both actions will create the inodes and dentries of all files and directories (because of "-R"). But once that is done, the ref counts go to zero. They stay around until reclaim. But then I open Chrome ;-) and it reclaims all the dentries and inodes, so we are back to here we were on boot. Now as rostedt I do: ls /sys/kernel/tracing/events/xfs The VFS layer doesn't know if I have permission to that or not, because all the inodes and dentries have been freed. It has to call back to eventfs to find out. Which the eventfs_root_lookup() and eventfs_iterate_shared() will recreated the inodes with the proper permission. Or are you saying that I don't need the ".permission" callback, because eventfs does it when it creates the inodes? But for eventfs to know what the permissions changes are, it uses .getattr and .setattr. -- Steve
Re: [PATCH] tracefs/eventfs: Use root and instance inodes as default ownership
On Mon, Jan 08, 2024 at 10:23:31AM -0500, Steven Rostedt wrote: > On Mon, 8 Jan 2024 12:04:54 +0100 > Christian Brauner wrote: > > > > > IOW, the inode_permission() in lookup_one_len() that eventfs does is > > > > redundant and just wrong. > > > > > > I don't think so. > > > > I'm very well aware that the dentries and inode aren't created during > > mkdir but the completely directory layout is determined. You're just > > splicing in dentries and inodes during lookup and readdir. > > > > If mkdir /sys/kernel/tracing/instances/foo has succeeded and you later > > do a lookup/readdir on > > > > ls -al /sys/kernel/tracing/instances/foo/events > > > > Why should the creation of the dentries and inodes ever fail due to a > > permission failure? > > They shouldn't. > > > The vfs did already verify that you had the required > > permissions to list entries in that directory. Why should filling up > > /sys/kernel/tracing/instances/foo/events ever fail then? It shouldn't > > That tracefs instance would be half-functional. And again, right now > > that inode_permission() check cannot even fail. > > And it shouldn't. But without dentries and inodes, how does VFS know what > is allowed to open the files? So say you do: mkdir /sys/kernel/tracing/instances/foo After this has returned we know everything we need to know about the new tracefs instance including the ownership and the mode of all inodes in /sys/kernel/tracing/instances/foo/events/* and below precisely because ownership is always inherited from the parent dentry and recorded in the metadata struct eventfs_inode. So say someone does: open("/sys/kernel/tracing/instances/foo/events/xfs"); and say this is the first time that someone accesses that events/ directory. When the open pathwalk is done, the vfs will determine via [1] may_lookup(inode_of(events)) whether you are able to list entries such as "xfs" in that directory. The vfs checks inode_permission(MAY_EXEC) on "events" and if that holds it ends up calling i_op->eventfs_root_lookup(events). At this point tracefs/eventfs adds the inodes for all entries in that "events" directory including "xfs" based on the metadata it recorded during the mkdir. Since now someone is actually interested in them. And it initializes the inodes with ownership and everything and adds the dentries that belong into that directory. Nothing here depends on the permissions of the caller. The only permission that mattered was done in the VFS in [1]. If the caller has permissions to enter a directory they can lookup and list its contents. And its contents where determined/fixed etc when mkdir was called. So we just need to add the required objects into the caches (inode, dentry) whose addition we intentionally defered until someone actually needed them. So, eventfs_root_lookup() now initializes the inodes with the ownership from the stored metadata or from the parent dentry and splices in inodes and dentries. No permission checking is needed for this because it is always a recheck of what the vfs did in [1]. We now return to the vfs and path walk continues to the final component that you actually want to open which is that "xfs" directory in this example. We check the permissions on that inode via may_open("xfs") and we open that directory returning an fd to userspace ultimately. (I'm going by memory since I need to step out the door.)
Re: [PATCH] tracefs/eventfs: Use root and instance inodes as default ownership
On Mon, 8 Jan 2024 12:32:46 +0100 Christian Brauner wrote: > On Sun, Jan 07, 2024 at 01:32:28PM -0500, Steven Rostedt wrote: > > On Sun, 7 Jan 2024 13:29:12 -0500 > > Steven Rostedt wrote: > > > > > > > > > > IOW, the inode_permission() in lookup_one_len() that eventfs does is > > > > redundant and just wrong. > > > > > > I don't think so. > > > > Just to make it clear. eventfs has nothing to do with mkdir instance/foo. > > It exists without that. Although one rationale to do eventfs was so > > Every instance/foo/ tracefs instances also contains an events directory > and thus a eventfs portion. Eventfs is just a subtree of tracefs. It's > not a separate filesystem. Both eventfs and tracefs are on the same > single, system wide superblock. > > > that the instance directories wouldn't recreate the same 10thousands > > event inodes and dentries for every mkdir done. > > I know but that's irrelevant to what I'm trying to tell you. > > A mkdir /sys/kernel/tracing/instances/foo creates a new tracefs > instance. With or without the on-demand dentry and inode creation for > the eventfs portion that tracefs "instance" has now been created in its > entirety including all the required information for someone to later > come along and perform a lookup on /sys/kernel/tracing/instances/foo/events. > > All you've done is to defer the addition of the dentries and inodes when > someone does actually look at the events directory of the tracefs > instance. > > Whether you choose to splice in the dentries and inodes for the eventfs > portion during lookup and readdir or if you had chosen to not do the > on-demand thing at all and the entries were created at the same time as > the mkdir call are equivalent from the perspective of permission > checking. > > If you have the required permissions to look at the events directory > then there's no reason why listing the directory entries in there should > fail. This can't even happen right now. Ah, I think I know where the confusion lies. The tracing information in kernel/trace/*.c doesn't keep track of permission. It relies totally on fs/tracefs/* to do so. If someone does 'chmod' or 'chown' or mounts with 'gid=xxx' then it's up to tracefs to maintain that information and not the tracing subsystem. The tracing subsystem only gives the "default" permissions (before boot finishes). The difference between normal file systems and pseudo file systems like debugfs and tracefs, is that normal file systems keep the permission information stored on the external device. That is, when the inodes and dentries are created, the information is retrieved from the stored file system. I think this may actually be a failure of debugfs (and tracefs as it was based on debugfs), in that the inodes and dentries are created at the same time the "files" backing them are. Which is normally at boot up and before the file system is mounted. That is, inodes and dentries are actually coupled with the data they represent. It's not a cache for a back store like a hard drive partition. To create a file in debugfs you do: struct dentry *debugfs_create_file(const char *name, umode_t mode, struct dentry *parent, void *data, const struct file_operations *fops) That is, you pass a the name, the mode, the parent dentry, data, and the fops and that will create an inode and dentry (which is returned). This happens at boot up before user space is running and before debugfs is even mounted. Because debugfs is mostly for debugging, people don't care about how it's mounted. It is usually restricted to root only access. Especially since there's a lot of sensitive information that shouldn't be exposed to non-privileged users. The reason tracefs came about is that people asked me to be able to have access to tracing without needing to even enable debugfs. They also want to easily make it accessible to non root users and talking with Kees Cook, he recommended using ACL. But because it inherited a lot from debugfs, I started doing these tricks like walking the dentry tree to make it work a bit better. Because the dentries and inodes were created before mount, I had to play these tricks. But as Linus pointed out, that was the wrong way to do that. The right way was to use .getattr and .permission callbacks to figure out what the permissions to the files are. This has nothing to do with the creation of the files, it's about who has access to the files that the inodes point to. This sounds like another topic to bring up at LSFMM ;-) "Can we standardize pseudo file systems like debugfs and tracefs to act more like real file systems, and have inodes and dentries act as cache and not be so coupled to the data?" -- Steve
Re: [PATCH] tracefs/eventfs: Use root and instance inodes as default ownership
On Mon, 8 Jan 2024 12:04:54 +0100 Christian Brauner wrote: > > > IOW, the inode_permission() in lookup_one_len() that eventfs does is > > > redundant and just wrong. > > > > I don't think so. > > I'm very well aware that the dentries and inode aren't created during > mkdir but the completely directory layout is determined. You're just > splicing in dentries and inodes during lookup and readdir. > > If mkdir /sys/kernel/tracing/instances/foo has succeeded and you later > do a lookup/readdir on > > ls -al /sys/kernel/tracing/instances/foo/events > > Why should the creation of the dentries and inodes ever fail due to a > permission failure? They shouldn't. > The vfs did already verify that you had the required > permissions to list entries in that directory. Why should filling up > /sys/kernel/tracing/instances/foo/events ever fail then? It shouldn't > That tracefs instance would be half-functional. And again, right now > that inode_permission() check cannot even fail. And it shouldn't. But without dentries and inodes, how does VFS know what is allowed to open the files? -- Steve
Re: [PATCH] tracefs/eventfs: Use root and instance inodes as default ownership
On Sun, Jan 07, 2024 at 01:32:28PM -0500, Steven Rostedt wrote: > On Sun, 7 Jan 2024 13:29:12 -0500 > Steven Rostedt wrote: > > > > > > > IOW, the inode_permission() in lookup_one_len() that eventfs does is > > > redundant and just wrong. > > > > I don't think so. > > Just to make it clear. eventfs has nothing to do with mkdir instance/foo. > It exists without that. Although one rationale to do eventfs was so Every instance/foo/ tracefs instances also contains an events directory and thus a eventfs portion. Eventfs is just a subtree of tracefs. It's not a separate filesystem. Both eventfs and tracefs are on the same single, system wide superblock. > that the instance directories wouldn't recreate the same 10thousands > event inodes and dentries for every mkdir done. I know but that's irrelevant to what I'm trying to tell you. A mkdir /sys/kernel/tracing/instances/foo creates a new tracefs instance. With or without the on-demand dentry and inode creation for the eventfs portion that tracefs "instance" has now been created in its entirety including all the required information for someone to later come along and perform a lookup on /sys/kernel/tracing/instances/foo/events. All you've done is to defer the addition of the dentries and inodes when someone does actually look at the events directory of the tracefs instance. Whether you choose to splice in the dentries and inodes for the eventfs portion during lookup and readdir or if you had chosen to not do the on-demand thing at all and the entries were created at the same time as the mkdir call are equivalent from the perspective of permission checking. If you have the required permissions to look at the events directory then there's no reason why listing the directory entries in there should fail. This can't even happen right now.
Re: [PATCH] tracefs/eventfs: Use root and instance inodes as default ownership
> > * Tracefs supports the creation of instances from userspace via mkdir. > > For example, > > > > mkdir /sys/kernel/tracing/instances/foo > > > > And here the idmapping is relevant so we need to make the helpers > > aware of the idmapping. > > > > I just went and plumbed this through to most helpers. > > > > There's some subtlety in eventfs. Afaict, the directories and files for > > the individual events are created on-demand during lookup or readdir. > > > > The ownership of these events is again inherited from the parent inode > > or recovered from stored state. In both cases the actual idmapping is > > irrelevant. > > > > The callchain here is: > > > > eventfs_root_lookup("xfs", "events") > > -> create_{dir,file}_dentry("xfs", "events") > >-> create_{dir,file}("xfs", "events") > > -> eventfs_start_creating("xfs", "events") > > -> lookup_one_len("xfs", "events") > > > > And the subtlety is that lookup_one_len() does permission checking on > > the parent inode (IOW, if you want a dentry for "blech" under "events" > > it'll do a permission check on events->d_inode) for exec permissions > > and then goes on to give you a new dentry. > > > > Usually this call would have to be changed to lookup_one() and the > > idmapping be handed down to it. But I think that's irrelevant here. > > > > Lookup generally doesn't need to be aware of idmappings at all. The > > permission checking is done purely in the vfs via may_lookup() and the > > idmapping is irrelevant because we always initialize inodes with the > > filesystem level ownership (see the idmappings.rst) documentation if > > you're interested in excessive details (otherwise you get inode aliases > > which you really don't want). > > > > For tracefs it would not matter for lookup per se but only because > > tracefs seemingly creates inodes/dentries during lookup (and readdir()). > > tracefs creates the inodes/dentries at boot up, it's eventfs that does > it on demand during lookup. > > For inodes/dentries: > > /sys/kernel/tracing/* is all created at boot up, except for "events". Yes. > /sys/kernel/tracing/events/* is created on demand. Yes. > > > > > But imho the permission checking done in current eventfs_root_lookup() > > via lookup_one_len() is meaningless in any way; possibly even > > (conceptually) wrong. > > > > Because, the actual permission checking for the creation of the eventfs > > entries isn't really done during lookup or readdir, it's done when mkdir > > is called: > > > > mkdir /sys/kernel/tracing/instances/foo > > No. that creates a entire new tracefs instance, which happens to > include another eventfs directory. Yes, I'm aware of all that. > No. Only the meta data is created for the eventfs directory with a > mkdir instances/foo. The inodes and dentries are not there. I know, that is what I'm saying. > > > > > When one goes and looksup stuff under foo/events/ or readdir the entries > > in that directory: > > > > fd = open("foo/events") > > readdir(fd, ...) > > > > then they are licensed to list an entry in that directory. So all that > > needs to be done is to actually list those files in that directory. And > > since they already exist (they were created during mkdir) we just need > > to splice in inodes and dentries for them. But for that we shouldn't > > check permissions on the directory again. Because we've done that > > already correctly when the VFS called may_lookup(). > > No they do not exist. I am aware. > > > > > IOW, the inode_permission() in lookup_one_len() that eventfs does is > > redundant and just wrong. > > I don't think so. I'm very well aware that the dentries and inode aren't created during mkdir but the completely directory layout is determined. You're just splicing in dentries and inodes during lookup and readdir. If mkdir /sys/kernel/tracing/instances/foo has succeeded and you later do a lookup/readdir on ls -al /sys/kernel/tracing/instances/foo/events Why should the creation of the dentries and inodes ever fail due to a permission failure? The vfs did already verify that you had the required permissions to list entries in that directory. Why should filling up /sys/kernel/tracing/instances/foo/events ever fail then? It shouldn't That tracefs instance would be half-functional. And again, right now that inode_permission() check cannot even fail.
Re: [PATCH] tracefs/eventfs: Use root and instance inodes as default ownership
On Sun, 7 Jan 2024 13:29:12 -0500 Steven Rostedt wrote: > > > > IOW, the inode_permission() in lookup_one_len() that eventfs does is > > redundant and just wrong. > > I don't think so. Just to make it clear. eventfs has nothing to do with mkdir instance/foo. It exists without that. Although one rationale to do eventfs was so that the instance directories wouldn't recreate the same 10thousands event inodes and dentries for every mkdir done. -- Steve
Re: [PATCH] tracefs/eventfs: Use root and instance inodes as default ownership
On Sun, 7 Jan 2024 13:42:39 +0100 Christian Brauner wrote: > > So, I tried to do an exploratory patch even though I promised myself not > to do it. But hey... > > Some notes: > > * Permission handling for idmapped mounts is done completely in the > VFS. That's the case for all filesytems that don't have a custom > ->permission() handler. So there's nothing to do for us here. > > * Idmapped mount handling for ->getattr() is done completely by the VFS > if the filesystem doesn't have a custom ->getattr() handler. So we're > done here. > > * Tracefs doesn't support attribute changes via ->setattr() (chown(), > chmod etc.). So there's nothing to here. > > * Eventfs does support attribute changes via ->setattr(). But it relies > on simple_setattr() which is already idmapped mount aware. So there's > nothing for us to do. > > * Ownership is inherited from the parent inode (tracefs) or optionally > from stashed ownership information (eventfs). That means the idmapping > is irrelevant here. It's similar to the "inherit gid from parent > directory" logic we have in some circumstances. TL;DR nothing to do > here as well. The reason ownership is inherited from the parent is because the inodes are created at boot up before user space starts. eventfs inodes are created on demand after user space so it needs to either check the "default" ownership and permissions, or if the user changed a specific file/directory, it must save it and use that again if the inode/dentry are reclaimed and then referenced again and recreated. > > * Tracefs supports the creation of instances from userspace via mkdir. > For example, > > mkdir /sys/kernel/tracing/instances/foo > > And here the idmapping is relevant so we need to make the helpers > aware of the idmapping. > > I just went and plumbed this through to most helpers. > > There's some subtlety in eventfs. Afaict, the directories and files for > the individual events are created on-demand during lookup or readdir. > > The ownership of these events is again inherited from the parent inode > or recovered from stored state. In both cases the actual idmapping is > irrelevant. > > The callchain here is: > > eventfs_root_lookup("xfs", "events") > -> create_{dir,file}_dentry("xfs", "events") >-> create_{dir,file}("xfs", "events") > -> eventfs_start_creating("xfs", "events") > -> lookup_one_len("xfs", "events") > > And the subtlety is that lookup_one_len() does permission checking on > the parent inode (IOW, if you want a dentry for "blech" under "events" > it'll do a permission check on events->d_inode) for exec permissions > and then goes on to give you a new dentry. > > Usually this call would have to be changed to lookup_one() and the > idmapping be handed down to it. But I think that's irrelevant here. > > Lookup generally doesn't need to be aware of idmappings at all. The > permission checking is done purely in the vfs via may_lookup() and the > idmapping is irrelevant because we always initialize inodes with the > filesystem level ownership (see the idmappings.rst) documentation if > you're interested in excessive details (otherwise you get inode aliases > which you really don't want). > > For tracefs it would not matter for lookup per se but only because > tracefs seemingly creates inodes/dentries during lookup (and readdir()). tracefs creates the inodes/dentries at boot up, it's eventfs that does it on demand during lookup. For inodes/dentries: /sys/kernel/tracing/* is all created at boot up, except for "events". /sys/kernel/tracing/events/* is created on demand. > > But imho the permission checking done in current eventfs_root_lookup() > via lookup_one_len() is meaningless in any way; possibly even > (conceptually) wrong. > > Because, the actual permission checking for the creation of the eventfs > entries isn't really done during lookup or readdir, it's done when mkdir > is called: > > mkdir /sys/kernel/tracing/instances/foo No. that creates a entire new tracefs instance, which happens to include another eventfs directory. The eventsfs directory is in "/sys/kernel/tracing/events" and "/sys/kernel/tracing/instances/*/events" eventfs has 10s of thousands of files and directories which is why I changed it to be on-demand inode/dentry creation and also reclaiming when no longer accessed. # ls /sys/kernel/tracing/events/ will create the inodes and dentries, and a memory stress program will free those created inodes and dentries. > > Here, all po
Re: [PATCH] tracefs/eventfs: Use root and instance inodes as default ownership
On Sun, Jan 07, 2024 at 06:42:33PM +0100, Christian Brauner wrote: > On Sun, Jan 07, 2024 at 01:42:39PM +0100, Christian Brauner wrote: > > > > So tracefs supports remounting with different uid/gid mount options and > > > > then actually wades through _all_ of the inodes and changes their > > > > ownership internally? What's the use-case for this? Containers? > > > > > > No, in fact tracing doesn't work well with containers as tracing is global > > > to the entire machine. It can work with privileged containers though. > > > > At least the tracefs interface is easily supportable within a delegation > > model. IOW, you have a privileged process that delegates relevant > > portions to a container via idmapped mounts _without_ doing the insane thing > > and making it mountable by a container aka the fs-to-CVE pipeline. > > > > > > > > The reason for this is because tracefs was based off of debugfs where the > > > files and directores are created at boot up and mounted later. The reason > > > to do this was to allow users to mount with gid=GID to allow a given group > > > to have access to tracing. Without this update, tracefs would ignore it > > > like debugfs and proc does today. > > > > > > I think its time I explain the purpose of tracefs and how it came to be. > > > > > > The tracing system required a way to control tracing and read the traces. > > > It could have just used a new system like perf (although > > > /sys/kernel/debug/tracing predates perf), where it created a single > > > ioctl() > > > like system call do do everything. > > > > > > As the ftrace tracing came from PREEMPT_RT latency tracer and my own > > > logdev > > > tracer, which both have an embedded background, I chose an interface that > > > could work with just an unmodified version of busybox. That is, I wanted > > > it > > > to work with just cat and echo. > > > > > > The main difference with tracefs compared to other file systems is that it > > > is a control interface, where writes happen as much as reads. The data > > > read > > > is controlled. The closest thing I can think of is how cgroups work. > > > > > > As tracing is a privileged operation, but something that could be changed > > > to allow a group to have access to, I wanted to make it easy for an admin > > > to decide who gets to do what at boot up via the /etc/fstab file. > > > > Yeah, ok. I think you could achieve the same thing via idmapped mounts. You > > just need to swap out the mnt on /sys/kernel/tracing with an idmapped mount. > > > > mount(8) should just give you the ability to specify "map the ids I > > explicitly > > want to remap to something else and for the rest use the identity mapping". > > I > > wanted that for other reasons anyway. > > > > So in one of the next versions of mount(8) you can then do (where --beneath > > means place the mount beneath the current one and --replace is > > self-explanatory): > > > > sudo mount --beneath -o X-mount.idmap='g:0:1234:1 u:0:0:1' > > /sys/kernel/tracing > > sudo umount /sys/kernel/tracing > > > > or as a shortcut provided by mount(8): > > > > sudo mount --replace -o X-mount.idmap='g:0:1234:1 u:0:0:1' > > /sys/kernel/tracing > > > > In both cases you replace the mount without unmounting tracefs. > > > > I can illustrate this right now though: > > > > user1@localhost:~$ sudo mount --bind -o X-mount.idmap='g:0:1000:1 > > u:0:1000:1' /sys/kernel/tracing/ /mnt/ > > > > # This is a tool I wrote for testing the patchset I wrote back then. > > user1@localhost:~/data/move-mount-beneath$ sudo ./move-mount --beneath > > --detached /mnt /sys/kernel/tracing > > Mounting beneath top mount > > Creating anonymous mount > > Attaching mount /mnt -> /sys/kernel/tracing > > Creating single detached mount > > > > user1@localhost:~/data/move-mount-beneath$ > > > > # Now there's two mounts stacked on top of each other. > > user1@localhost:~/data/move-mount-beneath$ findmnt | grep tracing > > | `-/sys/kernel/tracingtracefstracefs > > rw,nosuid,nodev,noexec,relatime,idmapped > > | `-/sys/kernel/tracing tracefstracefs > > rw,nosuid,nodev,noexec,relatime > > > > user1@localhost:~/data/move-mount-beneath$ sudo ls -al > > /sys/kernel/tracing/| head > > total 0 > > drwx-- 6 root root 0 J
Re: [PATCH] tracefs/eventfs: Use root and instance inodes as default ownership
On Sun, Jan 07, 2024 at 01:42:39PM +0100, Christian Brauner wrote: > > > So tracefs supports remounting with different uid/gid mount options and > > > then actually wades through _all_ of the inodes and changes their > > > ownership internally? What's the use-case for this? Containers? > > > > No, in fact tracing doesn't work well with containers as tracing is global > > to the entire machine. It can work with privileged containers though. > > At least the tracefs interface is easily supportable within a delegation > model. IOW, you have a privileged process that delegates relevant > portions to a container via idmapped mounts _without_ doing the insane thing > and making it mountable by a container aka the fs-to-CVE pipeline. > > > > > The reason for this is because tracefs was based off of debugfs where the > > files and directores are created at boot up and mounted later. The reason > > to do this was to allow users to mount with gid=GID to allow a given group > > to have access to tracing. Without this update, tracefs would ignore it > > like debugfs and proc does today. > > > > I think its time I explain the purpose of tracefs and how it came to be. > > > > The tracing system required a way to control tracing and read the traces. > > It could have just used a new system like perf (although > > /sys/kernel/debug/tracing predates perf), where it created a single ioctl() > > like system call do do everything. > > > > As the ftrace tracing came from PREEMPT_RT latency tracer and my own logdev > > tracer, which both have an embedded background, I chose an interface that > > could work with just an unmodified version of busybox. That is, I wanted it > > to work with just cat and echo. > > > > The main difference with tracefs compared to other file systems is that it > > is a control interface, where writes happen as much as reads. The data read > > is controlled. The closest thing I can think of is how cgroups work. > > > > As tracing is a privileged operation, but something that could be changed > > to allow a group to have access to, I wanted to make it easy for an admin > > to decide who gets to do what at boot up via the /etc/fstab file. > > Yeah, ok. I think you could achieve the same thing via idmapped mounts. You > just need to swap out the mnt on /sys/kernel/tracing with an idmapped mount. > > mount(8) should just give you the ability to specify "map the ids I explicitly > want to remap to something else and for the rest use the identity mapping". I > wanted that for other reasons anyway. > > So in one of the next versions of mount(8) you can then do (where --beneath > means place the mount beneath the current one and --replace is > self-explanatory): > > sudo mount --beneath -o X-mount.idmap='g:0:1234:1 u:0:0:1' /sys/kernel/tracing > sudo umount /sys/kernel/tracing > > or as a shortcut provided by mount(8): > > sudo mount --replace -o X-mount.idmap='g:0:1234:1 u:0:0:1' > /sys/kernel/tracing > > In both cases you replace the mount without unmounting tracefs. > > I can illustrate this right now though: > > user1@localhost:~$ sudo mount --bind -o X-mount.idmap='g:0:1000:1 u:0:1000:1' > /sys/kernel/tracing/ /mnt/ > > # This is a tool I wrote for testing the patchset I wrote back then. > user1@localhost:~/data/move-mount-beneath$ sudo ./move-mount --beneath > --detached /mnt /sys/kernel/tracing > Mounting beneath top mount > Creating anonymous mount > Attaching mount /mnt -> /sys/kernel/tracing > Creating single detached mount > > user1@localhost:~/data/move-mount-beneath$ > > # Now there's two mounts stacked on top of each other. > user1@localhost:~/data/move-mount-beneath$ findmnt | grep tracing > | `-/sys/kernel/tracingtracefstracefs > rw,nosuid,nodev,noexec,relatime,idmapped > | `-/sys/kernel/tracing tracefstracefs > rw,nosuid,nodev,noexec,relatime > > user1@localhost:~/data/move-mount-beneath$ sudo ls -al /sys/kernel/tracing/| > head > total 0 > drwx-- 6 root root 0 Jan 7 13:33 . > drwxr-xr-x 16 root root 0 Jan 7 13:33 .. > -r--r- 1 root root 0 Jan 7 13:33 README > -r--r- 1 root root 0 Jan 7 13:33 available_events > -r--r- 1 root root 0 Jan 7 13:33 available_filter_functions > -r--r- 1 root root 0 Jan 7 13:33 available_filter_functions_addrs > -r--r- 1 root root 0 Jan 7 13:33 available_tracers > -rw-r- 1 root root 0 Jan 7 13:33 buffer_percent > -rw-r- 1 root root 0 Jan 7 13:33 buffer_size_kb > > # Reveal updated mount > user1@localhost:~/data/move-mount-beneath$ sudo umount /sys/k
Re: [PATCH] tracefs/eventfs: Use root and instance inodes as default ownership
> > So tracefs supports remounting with different uid/gid mount options and > > then actually wades through _all_ of the inodes and changes their > > ownership internally? What's the use-case for this? Containers? > > No, in fact tracing doesn't work well with containers as tracing is global > to the entire machine. It can work with privileged containers though. At least the tracefs interface is easily supportable within a delegation model. IOW, you have a privileged process that delegates relevant portions to a container via idmapped mounts _without_ doing the insane thing and making it mountable by a container aka the fs-to-CVE pipeline. > > The reason for this is because tracefs was based off of debugfs where the > files and directores are created at boot up and mounted later. The reason > to do this was to allow users to mount with gid=GID to allow a given group > to have access to tracing. Without this update, tracefs would ignore it > like debugfs and proc does today. > > I think its time I explain the purpose of tracefs and how it came to be. > > The tracing system required a way to control tracing and read the traces. > It could have just used a new system like perf (although > /sys/kernel/debug/tracing predates perf), where it created a single ioctl() > like system call do do everything. > > As the ftrace tracing came from PREEMPT_RT latency tracer and my own logdev > tracer, which both have an embedded background, I chose an interface that > could work with just an unmodified version of busybox. That is, I wanted it > to work with just cat and echo. > > The main difference with tracefs compared to other file systems is that it > is a control interface, where writes happen as much as reads. The data read > is controlled. The closest thing I can think of is how cgroups work. > > As tracing is a privileged operation, but something that could be changed > to allow a group to have access to, I wanted to make it easy for an admin > to decide who gets to do what at boot up via the /etc/fstab file. Yeah, ok. I think you could achieve the same thing via idmapped mounts. You just need to swap out the mnt on /sys/kernel/tracing with an idmapped mount. mount(8) should just give you the ability to specify "map the ids I explicitly want to remap to something else and for the rest use the identity mapping". I wanted that for other reasons anyway. So in one of the next versions of mount(8) you can then do (where --beneath means place the mount beneath the current one and --replace is self-explanatory): sudo mount --beneath -o X-mount.idmap='g:0:1234:1 u:0:0:1' /sys/kernel/tracing sudo umount /sys/kernel/tracing or as a shortcut provided by mount(8): sudo mount --replace -o X-mount.idmap='g:0:1234:1 u:0:0:1' /sys/kernel/tracing In both cases you replace the mount without unmounting tracefs. I can illustrate this right now though: user1@localhost:~$ sudo mount --bind -o X-mount.idmap='g:0:1000:1 u:0:1000:1' /sys/kernel/tracing/ /mnt/ # This is a tool I wrote for testing the patchset I wrote back then. user1@localhost:~/data/move-mount-beneath$ sudo ./move-mount --beneath --detached /mnt /sys/kernel/tracing Mounting beneath top mount Creating anonymous mount Attaching mount /mnt -> /sys/kernel/tracing Creating single detached mount user1@localhost:~/data/move-mount-beneath$ # Now there's two mounts stacked on top of each other. user1@localhost:~/data/move-mount-beneath$ findmnt | grep tracing | `-/sys/kernel/tracingtracefstracefs rw,nosuid,nodev,noexec,relatime,idmapped | `-/sys/kernel/tracing tracefstracefs rw,nosuid,nodev,noexec,relatime user1@localhost:~/data/move-mount-beneath$ sudo ls -al /sys/kernel/tracing/| head total 0 drwx-- 6 root root 0 Jan 7 13:33 . drwxr-xr-x 16 root root 0 Jan 7 13:33 .. -r--r- 1 root root 0 Jan 7 13:33 README -r--r- 1 root root 0 Jan 7 13:33 available_events -r--r- 1 root root 0 Jan 7 13:33 available_filter_functions -r--r- 1 root root 0 Jan 7 13:33 available_filter_functions_addrs -r--r- 1 root root 0 Jan 7 13:33 available_tracers -rw-r- 1 root root 0 Jan 7 13:33 buffer_percent -rw-r- 1 root root 0 Jan 7 13:33 buffer_size_kb # Reveal updated mount user1@localhost:~/data/move-mount-beneath$ sudo umount /sys/kernel/tracing user1@localhost:~/data/move-mount-beneath$ findmnt | grep tracing | `-/sys/kernel/tracingtracefstracefs rw,nosuid,nodev,noexec,relatime,idmapped user1@localhost:~/data/move-mount-beneath$ sudo ls -al /sys/kernel/tracing/| head total 0 drwx-- 6 user1 user1 0 Jan 7 13:33 . drwxr-xr-x 16 root root 0 Jan 7 13:33 .. -r--r- 1 user1 user1 0 Jan 7 13:33 README -r--r- 1 user1 user1 0 Jan 7 13:33 available_events -r--r- 1 user1 user1 0 Jan 7 13:33 available_filter_functions -r--r- 1 user
Re: [PATCH] tracefs/eventfs: Use root and instance inodes as default ownership
On Fri, 5 Jan 2024 15:26:28 +0100 Christian Brauner wrote: > On Wed, Jan 03, 2024 at 08:32:46PM -0500, Steven Rostedt wrote: > > From: "Steven Rostedt (Google)" > > > > Instead of walking the dentries on mount/remount to update the gid values of > > all the dentries if a gid option is specified on mount, just update the root > > inode. Add .getattr, .setattr, and .permissions on the tracefs inode > > operations to update the permissions of the files and directories. > > > > For all files and directories in the top level instance: > > > > /sys/kernel/tracing/* > > > > It will use the root inode as the default permissions. The inode that > > represents: /sys/kernel/tracing (or wherever it is mounted). > > > > When an instance is created: > > > > mkdir /sys/kernel/tracing/instance/foo > > > > The directory "foo" and all its files and directories underneath will use > > the default of what foo is when it was created. A remount of tracefs will > > not affect it. > > That kinda sounds like eventfs should actually be a separate filesystem. > But I don't know enough about the relationship between the two concepts. It may someday become one, as eventfs is used by perf where the rest of the tracefs system is not. > > > > > If a user were to modify the permissions of any file or directory in > > tracefs, it will also no longer be modified by a change in ownership of a > > remount. > > Very odd semantics and I would recommend to avoid that. It's just plain > weird imo. > > > > > The events directory, if it is in the top level instance, will use the > > tracefs root inode as the default ownership for itself and all the files and > > directories below it. > > > > For the events directory in an instance ("foo"), it will keep the ownership > > of what it was when it was created, and that will be used as the default > > ownership for the files and directories beneath it. > > > > Link: > > https://lore.kernel.org/linux-trace-kernel/CAHk-=wjvdgkjdxbbvln2wbznqp4ush46e3gqj9m7ug6dpx2...@mail.gmail.com/ > > > > Signed-off-by: Steven Rostedt (Google) > > --- > > So tracefs supports remounting with different uid/gid mount options and > then actually wades through _all_ of the inodes and changes their > ownership internally? What's the use-case for this? Containers? No, in fact tracing doesn't work well with containers as tracing is global to the entire machine. It can work with privileged containers though. The reason for this is because tracefs was based off of debugfs where the files and directores are created at boot up and mounted later. The reason to do this was to allow users to mount with gid=GID to allow a given group to have access to tracing. Without this update, tracefs would ignore it like debugfs and proc does today. I think its time I explain the purpose of tracefs and how it came to be. The tracing system required a way to control tracing and read the traces. It could have just used a new system like perf (although /sys/kernel/debug/tracing predates perf), where it created a single ioctl() like system call do do everything. As the ftrace tracing came from PREEMPT_RT latency tracer and my own logdev tracer, which both have an embedded background, I chose an interface that could work with just an unmodified version of busybox. That is, I wanted it to work with just cat and echo. The main difference with tracefs compared to other file systems is that it is a control interface, where writes happen as much as reads. The data read is controlled. The closest thing I can think of is how cgroups work. As tracing is a privileged operation, but something that could be changed to allow a group to have access to, I wanted to make it easy for an admin to decide who gets to do what at boot up via the /etc/fstab file. > > Aside from optimizing this and the special semantics for this eventfs > stuff that you really should think twice of doing, here's one idea for > an extension that might alleviate some of the pain: > > If you need flexible dynamic ownership change to e.g., be able to > delegate (all, a directory, a single file of) tracefs to > unprivileged/containers/whatever then you might want to consider > supporting idmapped mounts for tracefs. Because then you can do stuff > like: I'm a novice here and have no idea on how id maps work ;-) > > user1@localhost:~/data/scripts$ sudo mount --bind -o > X-mount.idmap='g:0:1000:1 u:0:1234:1' /run/ /mnt > user1@localhost:~/data/scripts$ ls -ln /run/ > total 12 > drwxr-xr-x 2 0 0 40 Jan 5 12:12 credentials > drwx-- 2 0 0 40 Jan 5 11:57 cryptsetup > drwxr-xr-x 2 0 0 60 Jan
Re: [PATCH] tracefs/eventfs: Use root and instance inodes as default ownership
On Wed, Jan 03, 2024 at 08:32:46PM -0500, Steven Rostedt wrote: > From: "Steven Rostedt (Google)" > > Instead of walking the dentries on mount/remount to update the gid values of > all the dentries if a gid option is specified on mount, just update the root > inode. Add .getattr, .setattr, and .permissions on the tracefs inode > operations to update the permissions of the files and directories. > > For all files and directories in the top level instance: > > /sys/kernel/tracing/* > > It will use the root inode as the default permissions. The inode that > represents: /sys/kernel/tracing (or wherever it is mounted). > > When an instance is created: > > mkdir /sys/kernel/tracing/instance/foo > > The directory "foo" and all its files and directories underneath will use > the default of what foo is when it was created. A remount of tracefs will > not affect it. That kinda sounds like eventfs should actually be a separate filesystem. But I don't know enough about the relationship between the two concepts. > > If a user were to modify the permissions of any file or directory in > tracefs, it will also no longer be modified by a change in ownership of a > remount. Very odd semantics and I would recommend to avoid that. It's just plain weird imo. > > The events directory, if it is in the top level instance, will use the > tracefs root inode as the default ownership for itself and all the files and > directories below it. > > For the events directory in an instance ("foo"), it will keep the ownership > of what it was when it was created, and that will be used as the default > ownership for the files and directories beneath it. > > Link: > https://lore.kernel.org/linux-trace-kernel/CAHk-=wjvdgkjdxbbvln2wbznqp4ush46e3gqj9m7ug6dpx2...@mail.gmail.com/ > > Signed-off-by: Steven Rostedt (Google) > --- So tracefs supports remounting with different uid/gid mount options and then actually wades through _all_ of the inodes and changes their ownership internally? What's the use-case for this? Containers? Aside from optimizing this and the special semantics for this eventfs stuff that you really should think twice of doing, here's one idea for an extension that might alleviate some of the pain: If you need flexible dynamic ownership change to e.g., be able to delegate (all, a directory, a single file of) tracefs to unprivileged/containers/whatever then you might want to consider supporting idmapped mounts for tracefs. Because then you can do stuff like: user1@localhost:~/data/scripts$ sudo mount --bind -o X-mount.idmap='g:0:1000:1 u:0:1234:1' /run/ /mnt user1@localhost:~/data/scripts$ ls -ln /run/ total 12 drwxr-xr-x 2 0 0 40 Jan 5 12:12 credentials drwx-- 2 0 0 40 Jan 5 11:57 cryptsetup drwxr-xr-x 2 0 0 60 Jan 5 11:57 dbus drwx-- 6 0 0 280 Jan 5 11:57 incus_agent prw--- 1 0 00 Jan 5 11:57 initctl drwxrwxrwt 4 0 0 80 Jan 5 11:57 lock drwxr-xr-x 3 0 0 60 Jan 5 11:57 log drwx-- 2 0 0 40 Jan 5 11:57 lvm -r--r--r-- 1 0 0 33 Jan 5 11:57 machine-id -rw-r--r-- 1 0 0 101 Jan 5 11:58 motd.dynamic drwxr-xr-x 2 0 0 40 Jan 5 11:57 mount drwx-- 2 0 0 40 Jan 5 11:57 multipath drwxr-xr-x 2 0 0 40 Jan 5 11:57 sendsigs.omit.d lrwxrwxrwx 1 0 08 Jan 5 11:57 shm -> /dev/shm drwx--x--x 2 0 0 40 Jan 5 11:57 sudo drwxr-xr-x 24 0 0 660 Jan 5 14:30 systemd drwxr-xr-x 6 0 0 140 Jan 5 14:30 udev drwxr-xr-x 4 0 0 80 Jan 5 11:58 user -rw-rw-r-- 1 0 43 2304 Jan 5 15:15 utmp user1@localhost:~/data/scripts$ ls -ln /mnt/ total 12 drwxr-xr-x 2 1234 1000 40 Jan 5 12:12 credentials drwx-- 2 1234 1000 40 Jan 5 11:57 cryptsetup drwxr-xr-x 2 1234 1000 60 Jan 5 11:57 dbus drwxr-xr-x 2 1234 1000 40 Jan 5 11:57 incus_agent prw--- 1 1234 10000 Jan 5 11:57 initctl drwxr-xr-x 2 1234 1000 40 Jan 5 11:57 lock drwxr-xr-x 3 1234 1000 60 Jan 5 11:57 log drwx-- 2 1234 1000 40 Jan 5 11:57 lvm -r--r--r-- 1 1234 1000 33 Jan 5 11:57 machine-id -rw-r--r-- 1 1234 1000 101 Jan 5 11:58 motd.dynamic drwxr-xr-x 2 1234 1000 40 Jan 5 11:57 mount drwx-- 2 1234 1000 40 Jan 5 11:57 multipath drwxr-xr-x 2 1234 1000 40 Jan 5 11:57 sendsigs.omit.d lrwxrwxrwx 1 1234 10008 Jan 5 11:57 shm -> /dev/shm drwx--x--x 2 1234 1000 40 Jan 5 11:57 sudo drwxr-xr-x 24 1234 1000 660 Jan 5 14:30 systemd drwxr-xr-x 6 1234 1000 140 Jan 5 14:30 udev drwxr-xr-x 4 1234 1000 80 Jan 5 11:58 user -rw-rw-r-- 1 1234 65534 2304 Jan 5 15:15 utmp Where you can see that ownership of this tmpfs instance in this example is changed. I'm not trying to advocate here but this will probably ultimately be nicer for your users because it means that a container manager or whatever can be handed a part of tracefs (or all of it) and the own
Re: [PATCH] tracefs/eventfs: Use root and instance inodes as default ownership
On Wed, Jan 03, 2024 at 09:25:06PM -0500, Steven Rostedt wrote: > On Thu, 4 Jan 2024 01:48:37 + > Al Viro wrote: > > > On Wed, Jan 03, 2024 at 08:32:46PM -0500, Steven Rostedt wrote: > > > > > + /* Get the tracefs root from the parent */ > > > + inode = d_inode(dentry->d_parent); > > > + inode = d_inode(inode->i_sb->s_root); > > > > That makes no sense. First of all, for any positive dentry we have > > dentry->d_sb == dentry->d_inode->i_sb. And it's the same for all > > dentries on given superblock. So what's the point of that dance? > > If you want the root inode, just go for d_inode(dentry->d_sb->s_root) > > and be done with that... > > That was more of thinking that the dentry and dentry->d_parent are > different. As dentry is part of eventfs and dentry->d_parent is part of > tracefs. ??? >Currently they both have the same superblock so yeah, I could just > write it that way too and it would work. But in my head, I was thinking > that they behave differently and maybe one day eventfs would get its own > superblock which would not work. ->d_parent *always* points to the same filesystem; if you get an (automounted?) mountpoint there, ->d_parent simply won't work - it will point to dentry itself. > To explain this better: > > /sys/kernel/tracing/ is the parent of /sys/kernel/tracing/events > > But everything but "events" in /sys/kernel/tracing/* is part of tracefs. > Everything in /sys/kernel/tracing/events is part of eventfs. > > That was my thought process. But as both tracefs and eventfs still use > tracefs_get_inode(), it would work as you state. > > I'll update that, as I don't foresee that eventfs will become its own file > system. There is no way to get to underlying mountpoint by dentry - simply because the same fs instance can be mounted in any number of places. A crude overview of taxonomy: file_system_type: what filesystem instances belong to. Not quite the same thing as fs driver (one driver can provide several of those). Usually it's 1-to-1, but that's not required (e.g. NFS vs NFSv4, or ext[234], or...). super_block: individual filesystem instance. Hosts dentry tree (connected or several disconnected parts - think NFSv4 or the state while trying to get a dentry by fhandle, etc.). dentry: object in a filesystem's directory tree(s). Always belongs to specific filesystem instance - that relationship never changes. Tree structure (and names) _within_ _filesystem_ belong on that level. ->d_parent is part of that tree structure; never NULL, root of a (sub)tree has it pointing to itself. Might be negative, might refer to a filesystem object (file, directory, symlink, etc.). inode: filesystem object (file, directory, etc.). Always belongs to specific filesystem instance. Non-directory inodes might have any number of dentry instances (aliases) refering to it; a directory one - no more than one. Filesystem object contents belongs here; multiple hardlinks have different dentries and the same inode. Of course, filesystem type in question might have no such thing as multiple hardlinks - that's up to filesystem. In general there is no way to find (or enumerate) such links; e.g. a local filesystem might have an extra hardlink somewhere we had never looked at and there won't be any dentries for such hardlinks and no way to get them short of doing the full search of the entire tree. The situation with e.g. NFS client is even worse, obviously. mount: in a sense, mount to super_block is what dentry is to inode. It provides a view of (sub)tree hosted in given filesystem instance. The same filesystem may have any number of mounts, refering to its subtrees (possibly the same subtree for each, possibly all different - up to the callers of mount(2)). They form mount tree(s) - that's where the notions related to "this mounted on top of that" belong. Note that they can be moved around - with no telling the filesystem about that happening. Again, there's no such thing as "the mountpoint of given filesystem instance" - it might be mounted in any number of places at the same time. Specific mount - sure, no problem, except that it can move around. namespace: mount tree. Unlike everything prior, this one is a part of process state - same as descriptor table, mappings, etc. file: opened IO channel. It does refer to specific mount and specific dentry (and thus filesystem instance and an inode on it). Current IO position lives here, so does any per-open(2) state. descriptor table: mapping from numbers to IO channels (opened files). Again, a part of process state. dup() creates a new entry, with reference to the same file as the old one; multiple open() of the same pathname will each yield a separate opened file. _Some_ state belongs here (close-on-exec, m
[PATCH v2] tracefs/eventfs: Use root and instance inodes as default ownership
try *parent, void *data, const struct file_operations *fops) { + struct tracefs_inode *ti; struct dentry *dentry; struct inode *inode; @@ -621,7 +635,11 @@ struct dentry *tracefs_create_file(const char *name, umode_t mode, if (unlikely(!inode)) return tracefs_failed_creating(dentry); + ti = get_tracefs(inode); + ti->private = instance_inode(parent, inode); + inode->i_mode = mode; + inode->i_op = _file_inode_operations; inode->i_fop = fops ? fops : _file_operations; inode->i_private = data; inode->i_uid = d_inode(dentry->d_parent)->i_uid; @@ -634,6 +652,7 @@ struct dentry *tracefs_create_file(const char *name, umode_t mode, static struct dentry *__create_dir(const char *name, struct dentry *parent, const struct inode_operations *ops) { + struct tracefs_inode *ti; struct dentry *dentry = tracefs_start_creating(name, parent); struct inode *inode; @@ -651,6 +670,9 @@ static struct dentry *__create_dir(const char *name, struct dentry *parent, inode->i_uid = d_inode(dentry->d_parent)->i_uid; inode->i_gid = d_inode(dentry->d_parent)->i_gid; + ti = get_tracefs(inode); + ti->private = instance_inode(parent, inode); + /* directory inodes start off with i_nlink == 2 (for "." entry) */ inc_nlink(inode); d_instantiate(dentry, inode); @@ -681,7 +703,7 @@ struct dentry *tracefs_create_dir(const char *name, struct dentry *parent) if (security_locked_down(LOCKDOWN_TRACEFS)) return NULL; - return __create_dir(name, parent, _dir_inode_operations); + return __create_dir(name, parent, _dir_inode_operations); } /** @@ -712,7 +734,7 @@ __init struct dentry *tracefs_create_instance_dir(const char *name, if (WARN_ON(tracefs_ops.mkdir || tracefs_ops.rmdir)) return NULL; - dentry = __create_dir(name, parent, _dir_inode_operations); + dentry = __create_dir(name, parent, _instance_dir_inode_operations); if (!dentry) return NULL; diff --git a/fs/tracefs/internal.h b/fs/tracefs/internal.h index 42bdeb471a07..12b7d0150ae9 100644 --- a/fs/tracefs/internal.h +++ b/fs/tracefs/internal.h @@ -5,6 +5,9 @@ enum { TRACEFS_EVENT_INODE = BIT(1), TRACEFS_EVENT_TOP_INODE = BIT(2), + TRACEFS_GID_PERM_SET= BIT(3), + TRACEFS_UID_PERM_SET= BIT(4), + TRACEFS_INSTANCE_INODE = BIT(5), }; struct tracefs_inode { -- 2.42.0
Re: [PATCH] tracefs/eventfs: Use root and instance inodes as default ownership
On Thu, 4 Jan 2024 01:48:37 + Al Viro wrote: > On Wed, Jan 03, 2024 at 08:32:46PM -0500, Steven Rostedt wrote: > > > + /* Get the tracefs root from the parent */ > > + inode = d_inode(dentry->d_parent); > > + inode = d_inode(inode->i_sb->s_root); > > That makes no sense. First of all, for any positive dentry we have > dentry->d_sb == dentry->d_inode->i_sb. And it's the same for all > dentries on given superblock. So what's the point of that dance? > If you want the root inode, just go for d_inode(dentry->d_sb->s_root) > and be done with that... That was more of thinking that the dentry and dentry->d_parent are different. As dentry is part of eventfs and dentry->d_parent is part of tracefs. Currently they both have the same superblock so yeah, I could just write it that way too and it would work. But in my head, I was thinking that they behave differently and maybe one day eventfs would get its own superblock which would not work. To explain this better: /sys/kernel/tracing/ is the parent of /sys/kernel/tracing/events But everything but "events" in /sys/kernel/tracing/* is part of tracefs. Everything in /sys/kernel/tracing/events is part of eventfs. That was my thought process. But as both tracefs and eventfs still use tracefs_get_inode(), it would work as you state. I'll update that, as I don't foresee that eventfs will become its own file system. Thanks, -- Steve
Re: [PATCH] tracefs/eventfs: Use root and instance inodes as default ownership
On Thu, 4 Jan 2024 01:59:10 + Al Viro wrote: > On Wed, Jan 03, 2024 at 08:32:46PM -0500, Steven Rostedt wrote: > > > +static struct inode *instance_inode(struct dentry *parent, struct inode > > *inode) > > +{ > > + struct tracefs_inode *ti; > > + struct inode *root_inode; > > + > > + root_inode = d_inode(inode->i_sb->s_root); > > + > > + /* If parent is NULL then use root inode */ > > + if (!parent) > > + return root_inode; > > + > > + /* Find the inode that is flagged as an instance or the root inode */ > > + do { > > + inode = d_inode(parent); > > + if (inode == root_inode) > > + return root_inode; > > + > > + ti = get_tracefs(inode); > > + > > + if (ti->flags & TRACEFS_INSTANCE_INODE) > > + return inode; > > + } while ((parent = parent->d_parent)); > > *blink* > > This is equivalent to > ... > parent = parent->d_parent; > } while (true); Yeah, that loop went through a few iterations as I first thought that root was a tracefs_inode and the get_tracefs() would work on it. No, it was not, and it caused a cash. But I didn't rewrite the loop well after fixing that. I was also not sure if parent could be NULL, and wanted to protect against it. > > ->d_parent is *never* NULL. And what the hell is that loop supposed to do, > anyway? Find the nearest ancestor tagged with TRACEFS_INSTANCE_INODE? > > If root is not marked that way, I would suggest > if (!parent) > parent = inode->i_sb->s_root; > while (!IS_ROOT(parent)) { > struct tracefs_inode *ti = get_tracefs(parent->d_inode); > if (ti->flags & TRACEFS_INSTANCE_INODE) > break; > parent = parent->d_parent; > } > return parent->d_inode; Sure, I could rewrite it that way too. Thanks, -- Steve
Re: [PATCH] tracefs/eventfs: Use root and instance inodes as default ownership
On Wed, Jan 03, 2024 at 08:32:46PM -0500, Steven Rostedt wrote: > +static struct inode *instance_inode(struct dentry *parent, struct inode > *inode) > +{ > + struct tracefs_inode *ti; > + struct inode *root_inode; > + > + root_inode = d_inode(inode->i_sb->s_root); > + > + /* If parent is NULL then use root inode */ > + if (!parent) > + return root_inode; > + > + /* Find the inode that is flagged as an instance or the root inode */ > + do { > + inode = d_inode(parent); > + if (inode == root_inode) > + return root_inode; > + > + ti = get_tracefs(inode); > + > + if (ti->flags & TRACEFS_INSTANCE_INODE) > + return inode; > + } while ((parent = parent->d_parent)); *blink* This is equivalent to ... parent = parent->d_parent; } while (true); ->d_parent is *never* NULL. And what the hell is that loop supposed to do, anyway? Find the nearest ancestor tagged with TRACEFS_INSTANCE_INODE? If root is not marked that way, I would suggest if (!parent) parent = inode->i_sb->s_root; while (!IS_ROOT(parent)) { struct tracefs_inode *ti = get_tracefs(parent->d_inode); if (ti->flags & TRACEFS_INSTANCE_INODE) break; parent = parent->d_parent; } return parent->d_inode;
Re: [PATCH] tracefs/eventfs: Use root and instance inodes as default ownership
On Wed, Jan 03, 2024 at 08:32:46PM -0500, Steven Rostedt wrote: > + /* Get the tracefs root from the parent */ > + inode = d_inode(dentry->d_parent); > + inode = d_inode(inode->i_sb->s_root); That makes no sense. First of all, for any positive dentry we have dentry->d_sb == dentry->d_inode->i_sb. And it's the same for all dentries on given superblock. So what's the point of that dance? If you want the root inode, just go for d_inode(dentry->d_sb->s_root) and be done with that...
[PATCH] tracefs/eventfs: Use root and instance inodes as default ownership
const struct file_operations *fops) { + struct tracefs_inode *ti; struct dentry *dentry; struct inode *inode; @@ -621,7 +642,11 @@ struct dentry *tracefs_create_file(const char *name, umode_t mode, if (unlikely(!inode)) return tracefs_failed_creating(dentry); + ti = get_tracefs(inode); + ti->private = instance_inode(parent, inode); + inode->i_mode = mode; + inode->i_op = _file_inode_operations; inode->i_fop = fops ? fops : _file_operations; inode->i_private = data; inode->i_uid = d_inode(dentry->d_parent)->i_uid; @@ -634,6 +659,7 @@ struct dentry *tracefs_create_file(const char *name, umode_t mode, static struct dentry *__create_dir(const char *name, struct dentry *parent, const struct inode_operations *ops) { + struct tracefs_inode *ti; struct dentry *dentry = tracefs_start_creating(name, parent); struct inode *inode; @@ -651,6 +677,9 @@ static struct dentry *__create_dir(const char *name, struct dentry *parent, inode->i_uid = d_inode(dentry->d_parent)->i_uid; inode->i_gid = d_inode(dentry->d_parent)->i_gid; + ti = get_tracefs(inode); + ti->private = instance_inode(parent, inode); + /* directory inodes start off with i_nlink == 2 (for "." entry) */ inc_nlink(inode); d_instantiate(dentry, inode); @@ -681,7 +710,7 @@ struct dentry *tracefs_create_dir(const char *name, struct dentry *parent) if (security_locked_down(LOCKDOWN_TRACEFS)) return NULL; - return __create_dir(name, parent, _dir_inode_operations); + return __create_dir(name, parent, _dir_inode_operations); } /** @@ -712,7 +741,7 @@ __init struct dentry *tracefs_create_instance_dir(const char *name, if (WARN_ON(tracefs_ops.mkdir || tracefs_ops.rmdir)) return NULL; - dentry = __create_dir(name, parent, _dir_inode_operations); + dentry = __create_dir(name, parent, _instance_dir_inode_operations); if (!dentry) return NULL; diff --git a/fs/tracefs/internal.h b/fs/tracefs/internal.h index 42bdeb471a07..12b7d0150ae9 100644 --- a/fs/tracefs/internal.h +++ b/fs/tracefs/internal.h @@ -5,6 +5,9 @@ enum { TRACEFS_EVENT_INODE = BIT(1), TRACEFS_EVENT_TOP_INODE = BIT(2), + TRACEFS_GID_PERM_SET= BIT(3), + TRACEFS_UID_PERM_SET= BIT(4), + TRACEFS_INSTANCE_INODE = BIT(5), }; struct tracefs_inode { -- 2.42.0
Re: [PATCH] dax: make sure inodes are flushed before destroy cache
On Mon, Feb 14, 2022 at 12:09:54PM -0800, Dan Williams wrote: > On Mon, Feb 14, 2022 at 9:59 AM Ira Weiny wrote: > > > > On Fri, Feb 11, 2022 at 11:11:11PM -0800, Tong Zhang wrote: > > > A bug can be triggered by following command > > > > > > $ modprobe nd_pmem && modprobe -r nd_pmem > > > > > > [ 10.060014] BUG dax_cache (Not tainted): Objects remaining in > > > dax_cache on __kmem_cache_shutdown() > > > [ 10.060938] Slab 0x85b729ac objects=9 used=1 > > > fp=0x4f5ae469 flags=0x2010200(slab|head|node) > > > [ 10.062433] Call Trace: > > > [ 10.062673] dump_stack_lvl+0x34/0x44 > > > [ 10.062865] slab_err+0x90/0xd0 > > > [ 10.063619] __kmem_cache_shutdown+0x13b/0x2f0 > > > [ 10.063848] kmem_cache_destroy+0x4a/0x110 > > > [ 10.064058] __x64_sys_delete_module+0x265/0x300 > > > > > > This is caused by dax_fs_exit() not flushing inodes before destroy cache. > > > To fix this issue, call rcu_barrier() before destroy cache. > > > > I don't doubt that this fixes the bug. However, I can't help but think > > this is > > hiding a bug, or perhaps a missing step, in the kmem_cache layer? As far > > as I > > can see dax does not call call_rcu() and only uses srcu not rcu? I was > > tempted > > to suggest srcu_barrier() but dax does not call call_srcu() either. > > This rcu_barrier() is associated with the call_rcu() in destroy_inode(). Ok yea. > > While kern_unmount() does a full sycnrhonize_rcu() after clearing > ->mnt_ns. Any pending destroy_inode() callbacks need to be flushed > before the kmem_cache is destroyed. > > > So I'm not clear about what is really going on and why this fixes it. I > > know > > that dax is not using srcu is a standard way so perhaps this helps in a way > > I > > don't quite grok? If so perhaps a comment here would be in order? > > Looks like a common pattern I missed that all filesystem exit paths implement. I think a comment would be in order, especially since since it looks like every other FS has one: fs/ext4/super.c: ... /* * Make sure all delayed rcu free inodes are flushed before we * destroy cache. */ rcu_barrier(); ... Anyway ok. Reviewed-by: Ira Weiny Thanks for looking Dan, Ira
Re: [PATCH] dax: make sure inodes are flushed before destroy cache
On Mon, Feb 14, 2022 at 9:59 AM Ira Weiny wrote: > > On Fri, Feb 11, 2022 at 11:11:11PM -0800, Tong Zhang wrote: > > A bug can be triggered by following command > > > > $ modprobe nd_pmem && modprobe -r nd_pmem > > > > [ 10.060014] BUG dax_cache (Not tainted): Objects remaining in dax_cache > > on __kmem_cache_shutdown() > > [ 10.060938] Slab 0x85b729ac objects=9 used=1 > > fp=0x4f5ae469 flags=0x2010200(slab|head|node) > > [ 10.062433] Call Trace: > > [ 10.062673] dump_stack_lvl+0x34/0x44 > > [ 10.062865] slab_err+0x90/0xd0 > > [ 10.063619] __kmem_cache_shutdown+0x13b/0x2f0 > > [ 10.063848] kmem_cache_destroy+0x4a/0x110 > > [ 10.064058] __x64_sys_delete_module+0x265/0x300 > > > > This is caused by dax_fs_exit() not flushing inodes before destroy cache. > > To fix this issue, call rcu_barrier() before destroy cache. > > I don't doubt that this fixes the bug. However, I can't help but think this > is > hiding a bug, or perhaps a missing step, in the kmem_cache layer? As far as I > can see dax does not call call_rcu() and only uses srcu not rcu? I was > tempted > to suggest srcu_barrier() but dax does not call call_srcu() either. This rcu_barrier() is associated with the call_rcu() in destroy_inode(). While kern_unmount() does a full sycnrhonize_rcu() after clearing ->mnt_ns. Any pending destroy_inode() callbacks need to be flushed before the kmem_cache is destroyed. > So I'm not clear about what is really going on and why this fixes it. I know > that dax is not using srcu is a standard way so perhaps this helps in a way I > don't quite grok? If so perhaps a comment here would be in order? Looks like a common pattern I missed that all filesystem exit paths implement.
Re: [PATCH] dax: make sure inodes are flushed before destroy cache
On Fri, Feb 11, 2022 at 11:11:11PM -0800, Tong Zhang wrote: > A bug can be triggered by following command > > $ modprobe nd_pmem && modprobe -r nd_pmem > > [ 10.060014] BUG dax_cache (Not tainted): Objects remaining in dax_cache on > __kmem_cache_shutdown() > [ 10.060938] Slab 0x85b729ac objects=9 used=1 fp=0x4f5ae469 > flags=0x2010200(slab|head|node) > [ 10.062433] Call Trace: > [ 10.062673] dump_stack_lvl+0x34/0x44 > [ 10.062865] slab_err+0x90/0xd0 > [ 10.063619] __kmem_cache_shutdown+0x13b/0x2f0 > [ 10.063848] kmem_cache_destroy+0x4a/0x110 > [ 10.064058] __x64_sys_delete_module+0x265/0x300 > > This is caused by dax_fs_exit() not flushing inodes before destroy cache. > To fix this issue, call rcu_barrier() before destroy cache. I don't doubt that this fixes the bug. However, I can't help but think this is hiding a bug, or perhaps a missing step, in the kmem_cache layer? As far as I can see dax does not call call_rcu() and only uses srcu not rcu? I was tempted to suggest srcu_barrier() but dax does not call call_srcu() either. So I'm not clear about what is really going on and why this fixes it. I know that dax is not using srcu is a standard way so perhaps this helps in a way I don't quite grok? If so perhaps a comment here would be in order? Ira > > Signed-off-by: Tong Zhang > --- > drivers/dax/super.c | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/drivers/dax/super.c b/drivers/dax/super.c > index e3029389d809..6bd565fe2e63 100644 > --- a/drivers/dax/super.c > +++ b/drivers/dax/super.c > @@ -476,6 +476,7 @@ static int dax_fs_init(void) > static void dax_fs_exit(void) > { > kern_unmount(dax_mnt); > + rcu_barrier(); > kmem_cache_destroy(dax_cache); > } > > -- > 2.25.1 > >
Re: [PATCH] dax: make sure inodes are flushed before destroy cache
Looks good, Reviewed-by: Christoph Hellwig
[PATCH] dax: make sure inodes are flushed before destroy cache
A bug can be triggered by following command $ modprobe nd_pmem && modprobe -r nd_pmem [ 10.060014] BUG dax_cache (Not tainted): Objects remaining in dax_cache on __kmem_cache_shutdown() [ 10.060938] Slab 0x85b729ac objects=9 used=1 fp=0x4f5ae469 flags=0x2010200(slab|head|node) [ 10.062433] Call Trace: [ 10.062673] dump_stack_lvl+0x34/0x44 [ 10.062865] slab_err+0x90/0xd0 [ 10.063619] __kmem_cache_shutdown+0x13b/0x2f0 [ 10.063848] kmem_cache_destroy+0x4a/0x110 [ 10.064058] __x64_sys_delete_module+0x265/0x300 This is caused by dax_fs_exit() not flushing inodes before destroy cache. To fix this issue, call rcu_barrier() before destroy cache. Signed-off-by: Tong Zhang --- drivers/dax/super.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/dax/super.c b/drivers/dax/super.c index e3029389d809..6bd565fe2e63 100644 --- a/drivers/dax/super.c +++ b/drivers/dax/super.c @@ -476,6 +476,7 @@ static int dax_fs_init(void) static void dax_fs_exit(void) { kern_unmount(dax_mnt); + rcu_barrier(); kmem_cache_destroy(dax_cache); } -- 2.25.1
Re: [PATCH] tmpfs: Don't use 64-bit inodes by defulat with 32-bit ino_t
On Fri, Feb 05, 2021 at 01:23:13PM -0800, Andrew Morton wrote: > On Fri, 5 Feb 2021 14:55:43 -0600 Seth Forshee > wrote: > > > On Fri, Feb 05, 2021 at 12:41:57PM -0800, Andrew Morton wrote: > > > On Fri, 5 Feb 2021 14:21:59 -0600 Seth Forshee > > > wrote: > > > > > > > Currently there seems to be an assumption in tmpfs that 64-bit > > > > architectures also have a 64-bit ino_t. This is not true; s390 at > > > > least has a 32-bit ino_t. With CONFIG_TMPFS_INODE64=y tmpfs > > > > mounts will get 64-bit inode numbers and display "inode64" in the > > > > mount options, but passing the "inode64" mount option will fail. > > > > This leads to the following behavior: > > > > > > > > # mkdir mnt > > > > # mount -t tmpfs nodev mnt > > > > # mount -o remount,rw mnt > > > > mount: /home/ubuntu/mnt: mount point not mounted or bad option. > > > > > > > > As mount sees "inode64" in the mount options and thus passes it > > > > in the options for the remount. > > > > > > > > Ideally CONFIG_TMPFS_INODE64 would depend on sizeof(ino_t) < 8, > > > > but I don't think it's possible to test for this (potentially > > > > CONFIG_ARCH_HAS_64BIT_INO_T or similar could be added, but I'm > > > > not sure whether or not that is wanted). So fix this by simply > > > > refusing to honor the CONFIG_TMPFS_INODE64 setting when > > > > sizeof(ino_t) < 8. > > > > > > How about changing s390 Kconfig so that CONFIG_TMPFS_INODE64 is not > > > enabled? > > > > I did do that for our config. I see the s390 defconfig has it enabled, > > so I will send a patch for that too. But the fact that it can be > > configured that way and that the code behaves badly still seems > > problematic. > > I meant > > --- a/fs/Kconfig~a > +++ a/fs/Kconfig > @@ -203,7 +203,7 @@ config TMPFS_XATTR > > config TMPFS_INODE64 > bool "Use 64-bit ino_t by default in tmpfs" > - depends on TMPFS && 64BIT > + depends on TMPFS && 64BIT && !S390 > default n > help > tmpfs has historically used only inode numbers as wide as an unsigned > _ Ah. s390 does appear to be the only architecture that doesn't use unsigned long for ino_t, so that seems good to me. I can send a patch if you want, but seems like you've already got one ... Seth
Re: [PATCH] tmpfs: Don't use 64-bit inodes by defulat with 32-bit ino_t
On Fri, 5 Feb 2021 14:55:43 -0600 Seth Forshee wrote: > On Fri, Feb 05, 2021 at 12:41:57PM -0800, Andrew Morton wrote: > > On Fri, 5 Feb 2021 14:21:59 -0600 Seth Forshee > > wrote: > > > > > Currently there seems to be an assumption in tmpfs that 64-bit > > > architectures also have a 64-bit ino_t. This is not true; s390 at > > > least has a 32-bit ino_t. With CONFIG_TMPFS_INODE64=y tmpfs > > > mounts will get 64-bit inode numbers and display "inode64" in the > > > mount options, but passing the "inode64" mount option will fail. > > > This leads to the following behavior: > > > > > > # mkdir mnt > > > # mount -t tmpfs nodev mnt > > > # mount -o remount,rw mnt > > > mount: /home/ubuntu/mnt: mount point not mounted or bad option. > > > > > > As mount sees "inode64" in the mount options and thus passes it > > > in the options for the remount. > > > > > > Ideally CONFIG_TMPFS_INODE64 would depend on sizeof(ino_t) < 8, > > > but I don't think it's possible to test for this (potentially > > > CONFIG_ARCH_HAS_64BIT_INO_T or similar could be added, but I'm > > > not sure whether or not that is wanted). So fix this by simply > > > refusing to honor the CONFIG_TMPFS_INODE64 setting when > > > sizeof(ino_t) < 8. > > > > How about changing s390 Kconfig so that CONFIG_TMPFS_INODE64 is not enabled? > > I did do that for our config. I see the s390 defconfig has it enabled, > so I will send a patch for that too. But the fact that it can be > configured that way and that the code behaves badly still seems > problematic. I meant --- a/fs/Kconfig~a +++ a/fs/Kconfig @@ -203,7 +203,7 @@ config TMPFS_XATTR config TMPFS_INODE64 bool "Use 64-bit ino_t by default in tmpfs" - depends on TMPFS && 64BIT + depends on TMPFS && 64BIT && !S390 default n help tmpfs has historically used only inode numbers as wide as an unsigned _
Re: [PATCH] tmpfs: Don't use 64-bit inodes by defulat with 32-bit ino_t
On Fri, Feb 05, 2021 at 12:41:57PM -0800, Andrew Morton wrote: > On Fri, 5 Feb 2021 14:21:59 -0600 Seth Forshee > wrote: > > > Currently there seems to be an assumption in tmpfs that 64-bit > > architectures also have a 64-bit ino_t. This is not true; s390 at > > least has a 32-bit ino_t. With CONFIG_TMPFS_INODE64=y tmpfs > > mounts will get 64-bit inode numbers and display "inode64" in the > > mount options, but passing the "inode64" mount option will fail. > > This leads to the following behavior: > > > > # mkdir mnt > > # mount -t tmpfs nodev mnt > > # mount -o remount,rw mnt > > mount: /home/ubuntu/mnt: mount point not mounted or bad option. > > > > As mount sees "inode64" in the mount options and thus passes it > > in the options for the remount. > > > > Ideally CONFIG_TMPFS_INODE64 would depend on sizeof(ino_t) < 8, > > but I don't think it's possible to test for this (potentially > > CONFIG_ARCH_HAS_64BIT_INO_T or similar could be added, but I'm > > not sure whether or not that is wanted). So fix this by simply > > refusing to honor the CONFIG_TMPFS_INODE64 setting when > > sizeof(ino_t) < 8. > > How about changing s390 Kconfig so that CONFIG_TMPFS_INODE64 is not enabled? I did do that for our config. I see the s390 defconfig has it enabled, so I will send a patch for that too. But the fact that it can be configured that way and that the code behaves badly still seems problematic. > > > Fixes: ea3271f7196c ("tmpfs: support 64-bit inums per-sb") > > Signed-off-by: Seth Forshee > > With a cc:stable, I assume? Yes, I forgot that. Thanks. Seth
Re: [PATCH] tmpfs: Don't use 64-bit inodes by defulat with 32-bit ino_t
On Fri, 5 Feb 2021 14:21:59 -0600 Seth Forshee wrote: > Currently there seems to be an assumption in tmpfs that 64-bit > architectures also have a 64-bit ino_t. This is not true; s390 at > least has a 32-bit ino_t. With CONFIG_TMPFS_INODE64=y tmpfs > mounts will get 64-bit inode numbers and display "inode64" in the > mount options, but passing the "inode64" mount option will fail. > This leads to the following behavior: > > # mkdir mnt > # mount -t tmpfs nodev mnt > # mount -o remount,rw mnt > mount: /home/ubuntu/mnt: mount point not mounted or bad option. > > As mount sees "inode64" in the mount options and thus passes it > in the options for the remount. > > Ideally CONFIG_TMPFS_INODE64 would depend on sizeof(ino_t) < 8, > but I don't think it's possible to test for this (potentially > CONFIG_ARCH_HAS_64BIT_INO_T or similar could be added, but I'm > not sure whether or not that is wanted). So fix this by simply > refusing to honor the CONFIG_TMPFS_INODE64 setting when > sizeof(ino_t) < 8. How about changing s390 Kconfig so that CONFIG_TMPFS_INODE64 is not enabled? > Fixes: ea3271f7196c ("tmpfs: support 64-bit inums per-sb") > Signed-off-by: Seth Forshee With a cc:stable, I assume?
[PATCH] tmpfs: Don't use 64-bit inodes by defulat with 32-bit ino_t
Currently there seems to be an assumption in tmpfs that 64-bit architectures also have a 64-bit ino_t. This is not true; s390 at least has a 32-bit ino_t. With CONFIG_TMPFS_INODE64=y tmpfs mounts will get 64-bit inode numbers and display "inode64" in the mount options, but passing the "inode64" mount option will fail. This leads to the following behavior: # mkdir mnt # mount -t tmpfs nodev mnt # mount -o remount,rw mnt mount: /home/ubuntu/mnt: mount point not mounted or bad option. As mount sees "inode64" in the mount options and thus passes it in the options for the remount. Ideally CONFIG_TMPFS_INODE64 would depend on sizeof(ino_t) < 8, but I don't think it's possible to test for this (potentially CONFIG_ARCH_HAS_64BIT_INO_T or similar could be added, but I'm not sure whether or not that is wanted). So fix this by simply refusing to honor the CONFIG_TMPFS_INODE64 setting when sizeof(ino_t) < 8. Fixes: ea3271f7196c ("tmpfs: support 64-bit inums per-sb") Signed-off-by: Seth Forshee --- mm/shmem.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/shmem.c b/mm/shmem.c index 7c6b6d8f6c39..efde42acdc7a 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -3733,7 +3733,7 @@ static int shmem_fill_super(struct super_block *sb, struct fs_context *fc) ctx->blocks = shmem_default_max_blocks(); if (!(ctx->seen & SHMEM_SEEN_INODES)) ctx->inodes = shmem_default_max_inodes(); - if (!(ctx->seen & SHMEM_SEEN_INUMS)) + if (!(ctx->seen & SHMEM_SEEN_INUMS) && sizeof(ino_t) >= 8) ctx->full_inums = IS_ENABLED(CONFIG_TMPFS_INODE64); } else { sb->s_flags |= SB_NOUSER; -- 2.29.2
Re: [PATCH] ecryptfs: Fix inodes not being evicted until unmount
On Sat, Jan 30, 2021 at 11:06:40AM -0600, Tyler Hicks wrote: > Hey Jeffrey and Dan - thanks for the patch! Unfortunately, I think this > would allow the eCryptfs inode's nlink count to get out of sync with the > lower inode's nlink count in the case of direct manipulation to the > lower filesystem. Hmm. What if I instead synchronize it before calling vfs_unlink(), then call drop_nlink() if vfs_unlink() succeeds? > Is the condition that you're trying to fix a result of going through the > this code path? > > ecryptfs_unlink() -> ecryptfs_do_unlink() -> vfs_unlink() -> nfs_unlink() -> > nfs_sillyrename() -> nfs_async_unlink() Yes, that is the code path that causes it. V/R, Jeffrey Mitchell
Re: [PATCH] ecryptfs: Fix inodes not being evicted until unmount
On 2020-12-18 13:07:30, Jeffrey Mitchell wrote: > On asynchronous base filesystems like NFS, eCryptFS leaves inodes for > deleted files in the cache until unmounting. Change call in > ecryptfs_do_unlink() from set_nlink() to drop_nlink() in order to reliably > evict inodes from the cache even on top of NFS. > > Signed-off-by: Dan Robertson > Signed-off-by: Jeffrey Mitchell Hey Jeffrey and Dan - thanks for the patch! Unfortunately, I think this would allow the eCryptfs inode's nlink count to get out of sync with the lower inode's nlink count in the case of direct manipulation to the lower filesystem. Is the condition that you're trying to fix a result of going through the this code path? ecryptfs_unlink() -> ecryptfs_do_unlink() -> vfs_unlink() -> nfs_unlink() -> nfs_sillyrename() -> nfs_async_unlink() Tyler > --- > fs/ecryptfs/inode.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/fs/ecryptfs/inode.c b/fs/ecryptfs/inode.c > index e23752d..f7594b6 100644 > --- a/fs/ecryptfs/inode.c > +++ b/fs/ecryptfs/inode.c > @@ -147,7 +147,7 @@ static int ecryptfs_do_unlink(struct inode *dir, struct > dentry *dentry, > goto out_unlock; > } > fsstack_copy_attr_times(dir, lower_dir_inode); > - set_nlink(inode, ecryptfs_inode_to_lower(inode)->i_nlink); > + drop_nlink(inode); > inode->i_ctime = dir->i_ctime; > out_unlock: > dput(lower_dentry); > -- > 2.7.4 >
[PATCH 5.10 012/103] btrfs: shrink delalloc pages instead of full inodes
From: Josef Bacik [ Upstream commit e076ab2a2ca70a0270232067cd49f76cd92efe64 ] Commit 38d715f494f2 ("btrfs: use btrfs_start_delalloc_roots in shrink_delalloc") cleaned up how we do delalloc shrinking by utilizing some infrastructure we have in place to flush inodes that we use for device replace and snapshot. However this introduced a pretty serious performance regression. To reproduce the user untarred the source tarball of Firefox (360MiB xz compressed/1.5GiB uncompressed), and would see it take anywhere from 5 to 20 times as long to untar in 5.10 compared to 5.9. This was observed on fast devices (SSD and better) and not on HDD. The root cause is because before we would generally use the normal writeback path to reclaim delalloc space, and for this we would provide it with the number of pages we wanted to flush. The referenced commit changed this to flush that many inodes, which drastically increased the amount of space we were flushing in certain cases, which severely affected performance. We cannot revert this patch unfortunately because of 3d45f221ce62 ("btrfs: fix deadlock when cloning inline extent and low on free metadata space") which requires the ability to skip flushing inodes that are being cloned in certain scenarios, which means we need to keep using our flushing infrastructure or risk re-introducing the deadlock. Instead to fix this problem we can go back to providing btrfs_start_delalloc_roots with a number of pages to flush, and then set up a writeback_control and utilize sync_inode() to handle the flushing for us. This gives us the same behavior we had prior to the fix, while still allowing us to avoid the deadlock that was fixed by Filipe. I redid the users original test and got the following results on one of our test machines (256GiB of ram, 56 cores, 2TiB Intel NVMe drive) 5.9 0m54.258s 5.10 1m26.212s 5.10+patch0m38.800s 5.10+patch is significantly faster than plain 5.9 because of my patch series "Change data reservations to use the ticketing infra" which contained the patch that introduced the regression, but generally improved the overall ENOSPC flushing mechanisms. Additional testing on consumer-grade SSD (8GiB ram, 8 CPU) confirm the results: 5.10.54m00s 5.10.5+patch 1m08s 5.11-rc2 5m14s 5.11-rc2+patch1m30s Reported-by: René Rebe Fixes: 38d715f494f2 ("btrfs: use btrfs_start_delalloc_roots in shrink_delalloc") CC: sta...@vger.kernel.org # 5.10 Signed-off-by: Josef Bacik Tested-by: David Sterba Reviewed-by: David Sterba [ add my test results ] Signed-off-by: David Sterba Signed-off-by: Sasha Levin --- fs/btrfs/inode.c | 60 +++ fs/btrfs/space-info.c | 4 ++- 2 files changed, 46 insertions(+), 18 deletions(-) diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 07479250221da..acc47e2ffb46b 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -9389,7 +9389,8 @@ static struct btrfs_delalloc_work *btrfs_alloc_delalloc_work(struct inode *inode * some fairly slow code that needs optimization. This walks the list * of all the inodes with pending delalloc and forces them to disk. */ -static int start_delalloc_inodes(struct btrfs_root *root, u64 *nr, bool snapshot, +static int start_delalloc_inodes(struct btrfs_root *root, +struct writeback_control *wbc, bool snapshot, bool in_reclaim_context) { struct btrfs_inode *binode; @@ -9398,6 +9399,7 @@ static int start_delalloc_inodes(struct btrfs_root *root, u64 *nr, bool snapshot struct list_head works; struct list_head splice; int ret = 0; + bool full_flush = wbc->nr_to_write == LONG_MAX; INIT_LIST_HEAD(); INIT_LIST_HEAD(); @@ -9426,18 +9428,24 @@ static int start_delalloc_inodes(struct btrfs_root *root, u64 *nr, bool snapshot if (snapshot) set_bit(BTRFS_INODE_SNAPSHOT_FLUSH, >runtime_flags); - work = btrfs_alloc_delalloc_work(inode); - if (!work) { - iput(inode); - ret = -ENOMEM; - goto out; - } - list_add_tail(>list, ); - btrfs_queue_work(root->fs_info->flush_workers, ->work); - if (*nr != U64_MAX) { - (*nr)--; - if (*nr == 0) + if (full_flush) { + work = btrfs_alloc_delalloc_work(inode); + if (!work) { + iput(inode); + ret = -ENOMEM; + goto out; + } + list_add_tail(>list, ); + btrfs_queue_work(root->fs_inf
Re: [PATCH v15 0/4] SELinux support for anonymous inodes and UFFD
On Thu, Jan 14, 2021 at 2:47 PM Paul Moore wrote: > > On Tue, Jan 12, 2021 at 12:15 PM Paul Moore wrote: > > > > On Fri, Jan 8, 2021 at 5:22 PM Lokesh Gidra wrote: > > > > > > Userfaultfd in unprivileged contexts could be potentially very > > > useful. We'd like to harden userfaultfd to make such unprivileged use > > > less risky. This patch series allows SELinux to manage userfaultfd > > > file descriptors and in the future, other kinds of > > > anonymous-inode-based file descriptor. > > > > ... > > > > > Daniel Colascione (3): > > > fs: add LSM-supporting anon-inode interface > > > selinux: teach SELinux about anonymous inodes > > > userfaultfd: use secure anon inodes for userfaultfd > > > > > > Lokesh Gidra (1): > > > security: add inode_init_security_anon() LSM hook > > > > > > fs/anon_inodes.c| 150 > > > fs/libfs.c | 5 - > > > fs/userfaultfd.c| 19 ++-- > > > include/linux/anon_inodes.h | 5 + > > > include/linux/lsm_hook_defs.h | 2 + > > > include/linux/lsm_hooks.h | 9 ++ > > > include/linux/security.h| 10 ++ > > > security/security.c | 8 ++ > > > security/selinux/hooks.c| 57 +++ > > > security/selinux/include/classmap.h | 2 + > > > 10 files changed, 213 insertions(+), 54 deletions(-) > > > > With several rounds of reviews done and the corresponding SELinux test > > suite looking close to being ready I think it makes sense to merge > > this via the SELinux tree. VFS folks, if you have any comments or > > objections please let me know soon. If I don't hear anything within > > the next day or two I'll go ahead and merge this for linux-next. > > With no comments over the last two days I merged the patchset into > selinux/next. Thanks for all your work and patience on this Lokesh. > Thanks so much. > Also, it looks like you are very close to getting the associated > SELinux test suite additions merged, please continue to work with > Ondrej to get those merged soon. > Certainly! I'm waiting for his reviews for the latest patch. > -- > paul moore > www.paul-moore.com
Re: [PATCH v15 0/4] SELinux support for anonymous inodes and UFFD
On Tue, Jan 12, 2021 at 12:15 PM Paul Moore wrote: > > On Fri, Jan 8, 2021 at 5:22 PM Lokesh Gidra wrote: > > > > Userfaultfd in unprivileged contexts could be potentially very > > useful. We'd like to harden userfaultfd to make such unprivileged use > > less risky. This patch series allows SELinux to manage userfaultfd > > file descriptors and in the future, other kinds of > > anonymous-inode-based file descriptor. > > ... > > > Daniel Colascione (3): > > fs: add LSM-supporting anon-inode interface > > selinux: teach SELinux about anonymous inodes > > userfaultfd: use secure anon inodes for userfaultfd > > > > Lokesh Gidra (1): > > security: add inode_init_security_anon() LSM hook > > > > fs/anon_inodes.c| 150 > > fs/libfs.c | 5 - > > fs/userfaultfd.c| 19 ++-- > > include/linux/anon_inodes.h | 5 + > > include/linux/lsm_hook_defs.h | 2 + > > include/linux/lsm_hooks.h | 9 ++ > > include/linux/security.h| 10 ++ > > security/security.c | 8 ++ > > security/selinux/hooks.c| 57 +++ > > security/selinux/include/classmap.h | 2 + > > 10 files changed, 213 insertions(+), 54 deletions(-) > > With several rounds of reviews done and the corresponding SELinux test > suite looking close to being ready I think it makes sense to merge > this via the SELinux tree. VFS folks, if you have any comments or > objections please let me know soon. If I don't hear anything within > the next day or two I'll go ahead and merge this for linux-next. With no comments over the last two days I merged the patchset into selinux/next. Thanks for all your work and patience on this Lokesh. Also, it looks like you are very close to getting the associated SELinux test suite additions merged, please continue to work with Ondrej to get those merged soon. -- paul moore www.paul-moore.com
Re: [PATCH v15 0/4] SELinux support for anonymous inodes and UFFD
On Fri, Jan 8, 2021 at 5:22 PM Lokesh Gidra wrote: > > Userfaultfd in unprivileged contexts could be potentially very > useful. We'd like to harden userfaultfd to make such unprivileged use > less risky. This patch series allows SELinux to manage userfaultfd > file descriptors and in the future, other kinds of > anonymous-inode-based file descriptor. ... > Daniel Colascione (3): > fs: add LSM-supporting anon-inode interface > selinux: teach SELinux about anonymous inodes > userfaultfd: use secure anon inodes for userfaultfd > > Lokesh Gidra (1): > security: add inode_init_security_anon() LSM hook > > fs/anon_inodes.c| 150 > fs/libfs.c | 5 - > fs/userfaultfd.c| 19 ++-- > include/linux/anon_inodes.h | 5 + > include/linux/lsm_hook_defs.h | 2 + > include/linux/lsm_hooks.h | 9 ++ > include/linux/security.h| 10 ++ > security/security.c | 8 ++ > security/selinux/hooks.c| 57 +++ > security/selinux/include/classmap.h | 2 + > 10 files changed, 213 insertions(+), 54 deletions(-) With several rounds of reviews done and the corresponding SELinux test suite looking close to being ready I think it makes sense to merge this via the SELinux tree. VFS folks, if you have any comments or objections please let me know soon. If I don't hear anything within the next day or two I'll go ahead and merge this for linux-next. Thanks. -- paul moore www.paul-moore.com
[PATCH v15 4/4] userfaultfd: use secure anon inodes for userfaultfd
From: Daniel Colascione This change gives userfaultfd file descriptors a real security context, allowing policy to act on them. Signed-off-by: Daniel Colascione [LG: Remove owner inode from userfaultfd_ctx] [LG: Use anon_inode_getfd_secure() instead of anon_inode_getfile_secure() in userfaultfd syscall] [LG: Use inode of file in userfaultfd_read() in resolve_userfault_fork()] Signed-off-by: Lokesh Gidra Reviewed-by: Eric Biggers --- fs/userfaultfd.c | 19 ++- 1 file changed, 10 insertions(+), 9 deletions(-) diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c index 894cc28142e7..0be8cdd4425a 100644 --- a/fs/userfaultfd.c +++ b/fs/userfaultfd.c @@ -979,14 +979,14 @@ static __poll_t userfaultfd_poll(struct file *file, poll_table *wait) static const struct file_operations userfaultfd_fops; -static int resolve_userfault_fork(struct userfaultfd_ctx *ctx, - struct userfaultfd_ctx *new, +static int resolve_userfault_fork(struct userfaultfd_ctx *new, + struct inode *inode, struct uffd_msg *msg) { int fd; - fd = anon_inode_getfd("[userfaultfd]", _fops, new, - O_RDWR | (new->flags & UFFD_SHARED_FCNTL_FLAGS)); + fd = anon_inode_getfd_secure("[userfaultfd]", _fops, new, + O_RDWR | (new->flags & UFFD_SHARED_FCNTL_FLAGS), inode); if (fd < 0) return fd; @@ -996,7 +996,7 @@ static int resolve_userfault_fork(struct userfaultfd_ctx *ctx, } static ssize_t userfaultfd_ctx_read(struct userfaultfd_ctx *ctx, int no_wait, - struct uffd_msg *msg) + struct uffd_msg *msg, struct inode *inode) { ssize_t ret; DECLARE_WAITQUEUE(wait, current); @@ -1107,7 +1107,7 @@ static ssize_t userfaultfd_ctx_read(struct userfaultfd_ctx *ctx, int no_wait, spin_unlock_irq(>fd_wqh.lock); if (!ret && msg->event == UFFD_EVENT_FORK) { - ret = resolve_userfault_fork(ctx, fork_nctx, msg); + ret = resolve_userfault_fork(fork_nctx, inode, msg); spin_lock_irq(>event_wqh.lock); if (!list_empty(_event)) { /* @@ -1167,6 +1167,7 @@ static ssize_t userfaultfd_read(struct file *file, char __user *buf, ssize_t _ret, ret = 0; struct uffd_msg msg; int no_wait = file->f_flags & O_NONBLOCK; + struct inode *inode = file_inode(file); if (ctx->state == UFFD_STATE_WAIT_API) return -EINVAL; @@ -1174,7 +1175,7 @@ static ssize_t userfaultfd_read(struct file *file, char __user *buf, for (;;) { if (count < sizeof(msg)) return ret ? ret : -EINVAL; - _ret = userfaultfd_ctx_read(ctx, no_wait, ); + _ret = userfaultfd_ctx_read(ctx, no_wait, , inode); if (_ret < 0) return ret ? ret : _ret; if (copy_to_user((__u64 __user *) buf, , sizeof(msg))) @@ -1999,8 +2000,8 @@ SYSCALL_DEFINE1(userfaultfd, int, flags) /* prevent the mm struct to be freed */ mmgrab(ctx->mm); - fd = anon_inode_getfd("[userfaultfd]", _fops, ctx, - O_RDWR | (flags & UFFD_SHARED_FCNTL_FLAGS)); + fd = anon_inode_getfd_secure("[userfaultfd]", _fops, ctx, + O_RDWR | (flags & UFFD_SHARED_FCNTL_FLAGS), NULL); if (fd < 0) { mmdrop(ctx->mm); kmem_cache_free(userfaultfd_ctx_cachep, ctx); -- 2.30.0.284.gd98b1dd5eaa7-goog
[PATCH v15 0/4] SELinux support for anonymous inodes and UFFD
Userfaultfd in unprivileged contexts could be potentially very useful. We'd like to harden userfaultfd to make such unprivileged use less risky. This patch series allows SELinux to manage userfaultfd file descriptors and in the future, other kinds of anonymous-inode-based file descriptor. SELinux policy authors can apply policy types to anonymous inodes by providing name-based transition rules keyed off the anonymous inode internal name ( "[userfaultfd]" in the case of userfaultfd(2) file descriptors) and applying policy to the new SIDs thus produced. With SELinux managed userfaultfd, an admin can control creation and movement of the file descriptors. In particular, handling of a userfaultfd descriptor by a different process is essentially a ptrace access into the process, without any of the corresponding security_ptrace_access_check() checks. For privacy, the admin may want to deny such accesses, which is possible with SELinux support. Inside the kernel, a new anon_inode interface, anon_inode_getfd_secure, allows callers to opt into this SELinux management. In this new "secure" mode, anon_inodes create new ephemeral inodes for anonymous file objects instead of reusing the normal anon_inodes singleton dummy inode. A new LSM hook gives security modules an opportunity to configure and veto these ephemeral inodes. This patch series is one of two fork of [1] and is an alternative to [2]. The primary difference between the two patch series is that this partch series creates a unique inode for each "secure" anonymous inode, while the other patch series ([2]) continues using the singleton dummy anonymous inode and adds a way to attach SELinux security information directly to file objects. I prefer the approach in this patch series because 1) it's a smaller patch than [2], and 2) it produces a more regular security architecture: in this patch series, secure anonymous inodes aren't S_PRIVATE and they maintain the SELinux property that the label for a file is in its inode. We do need an additional inode per anonymous file, but per-struct-file inode creation doesn't seem to be a problem for pipes and sockets. The previous version of this feature ([1]) created a new SELinux security class for userfaultfd file descriptors. This version adopts the generic transition-based approach of [2]. This patch series also differs from [2] in that it doesn't affect all anonymous inodes right away --- instead requiring anon_inodes callers to opt in --- but this difference isn't one of basic approach. The important question to resolve is whether we should be creating new inodes or enhancing per-file data. Changes from the first version of the patch: - Removed some error checks - Defined a new anon_inode SELinux class to resolve the ambiguity in [3] - Inherit sclass as well as descriptor from context inode Changes from the second version of the patch: - Fixed example policy in the commit message to reflect the use of the new anon_inode class. Changes from the third version of the patch: - Dropped the fops parameter to the LSM hook - Documented hook parameters - Fixed incorrect class used for SELinux transition - Removed stray UFFD changed early in the series - Removed a redundant ERR_PTR(PTR_ERR()) Changes from the fourth version of the patch: - Removed an unused parameter from an internal function - Fixed function documentation Changes from the fifth version of the patch: - Fixed function documentation in fs/anon_inodes.c and include/linux/lsm_hooks.h - Used anon_inode_getfd_secure() in userfaultfd() syscall and removed owner from userfaultfd_ctx. Changes from the sixth version of the patch: - Removed definition of anon_inode_getfile_secure() as there are no callers. - Simplified function description of anon_inode_getfd_secure(). - Elaborated more on the purpose of 'context_inode' in commit message. Changes from the seventh version of the patch: - Fixed error handling in _anon_inode_getfile(). - Fixed minor comment and indentation related issues. Changes from the eighth version of the patch: - Replaced selinux_state.initialized with selinux_state.initialized Changes from the ninth version of the patch: - Fixed function names in fs/anon_inodes.c - Fixed comment of anon_inode_getfd_secure() - Fixed name of the patch wherein userfaultfd code uses anon_inode_getfd_secure() Changes from the tenth version of the patch: - Split first patch into VFS and LSM specific patches - Fixed comments in fs/anon_inodes.c - Fixed comment of alloc_anon_inode() Changes from the eleventh version of the patch: - Removed comment of alloc_anon_inode() for consistency with the code - Fixed explanation of LSM hook in the commit message Changes from the twelfth version of the patch: - Replaced FILE__CREATE with ANON_INODE__CREATE while initializing anon-inode's SELinux security struct. - Check context_inode's SE
[PATCH v15 3/4] selinux: teach SELinux about anonymous inodes
From: Daniel Colascione This change uses the anon_inodes and LSM infrastructure introduced in the previous patches to give SELinux the ability to control anonymous-inode files that are created using the new anon_inode_getfd_secure() function. A SELinux policy author detects and controls these anonymous inodes by adding a name-based type_transition rule that assigns a new security type to anonymous-inode files created in some domain. The name used for the name-based transition is the name associated with the anonymous inode for file listings --- e.g., "[userfaultfd]" or "[perf_event]". Example: type uffd_t; type_transition sysadm_t sysadm_t : anon_inode uffd_t "[userfaultfd]"; allow sysadm_t uffd_t:anon_inode { create }; (The next patch in this series is necessary for making userfaultfd support this new interface. The example above is just for exposition.) Signed-off-by: Daniel Colascione Signed-off-by: Lokesh Gidra --- security/selinux/hooks.c| 57 + security/selinux/include/classmap.h | 2 + 2 files changed, 59 insertions(+) diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c index 644b17ec9e63..a5e12b2fabde 100644 --- a/security/selinux/hooks.c +++ b/security/selinux/hooks.c @@ -2934,6 +2934,62 @@ static int selinux_inode_init_security(struct inode *inode, struct inode *dir, return 0; } +static int selinux_inode_init_security_anon(struct inode *inode, + const struct qstr *name, + const struct inode *context_inode) +{ + const struct task_security_struct *tsec = selinux_cred(current_cred()); + struct common_audit_data ad; + struct inode_security_struct *isec; + int rc; + + if (unlikely(!selinux_initialized(_state))) + return 0; + + isec = selinux_inode(inode); + + /* +* We only get here once per ephemeral inode. The inode has +* been initialized via inode_alloc_security but is otherwise +* untouched. +*/ + + if (context_inode) { + struct inode_security_struct *context_isec = + selinux_inode(context_inode); + if (context_isec->initialized != LABEL_INITIALIZED) { + pr_err("SELinux: context_inode is not initialized"); + return -EACCES; + } + + isec->sclass = context_isec->sclass; + isec->sid = context_isec->sid; + } else { + isec->sclass = SECCLASS_ANON_INODE; + rc = security_transition_sid( + _state, tsec->sid, tsec->sid, + isec->sclass, name, >sid); + if (rc) + return rc; + } + + isec->initialized = LABEL_INITIALIZED; + /* +* Now that we've initialized security, check whether we're +* allowed to actually create this type of anonymous inode. +*/ + + ad.type = LSM_AUDIT_DATA_INODE; + ad.u.inode = inode; + + return avc_has_perm(_state, + tsec->sid, + isec->sid, + isec->sclass, + FILE__CREATE, + ); +} + static int selinux_inode_create(struct inode *dir, struct dentry *dentry, umode_t mode) { return may_create(dir, dentry, SECCLASS_FILE); @@ -7000,6 +7056,7 @@ static struct security_hook_list selinux_hooks[] __lsm_ro_after_init = { LSM_HOOK_INIT(inode_free_security, selinux_inode_free_security), LSM_HOOK_INIT(inode_init_security, selinux_inode_init_security), + LSM_HOOK_INIT(inode_init_security_anon, selinux_inode_init_security_anon), LSM_HOOK_INIT(inode_create, selinux_inode_create), LSM_HOOK_INIT(inode_link, selinux_inode_link), LSM_HOOK_INIT(inode_unlink, selinux_inode_unlink), diff --git a/security/selinux/include/classmap.h b/security/selinux/include/classmap.h index 40cebde62856..ba2e01a6955c 100644 --- a/security/selinux/include/classmap.h +++ b/security/selinux/include/classmap.h @@ -249,6 +249,8 @@ struct security_class_mapping secclass_map[] = { {"open", "cpu", "kernel", "tracepoint", "read", "write"} }, { "lockdown", { "integrity", "confidentiality", NULL } }, + { "anon_inode", + { COMMON_FILE_PERMS, NULL } }, { NULL } }; -- 2.30.0.284.gd98b1dd5eaa7-goog
Re: [PATCH v13 3/4] selinux: teach SELinux about anonymous inodes
On Fri, Jan 8, 2021 at 1:24 PM Stephen Smalley wrote: > > On Fri, Jan 8, 2021 at 3:17 PM Lokesh Gidra wrote: > > > > On Fri, Jan 8, 2021 at 11:35 AM Stephen Smalley > > wrote: > > > > > > On Wed, Jan 6, 2021 at 10:03 PM Paul Moore wrote: > > > > > > > > On Wed, Nov 11, 2020 at 8:54 PM Lokesh Gidra > > > > wrote: > > > > > From: Daniel Colascione > > > > > > > > > > This change uses the anon_inodes and LSM infrastructure introduced in > > > > > the previous patches to give SELinux the ability to control > > > > > anonymous-inode files that are created using the new > > > > > anon_inode_getfd_secure() function. > > > > > > > > > > A SELinux policy author detects and controls these anonymous inodes by > > > > > adding a name-based type_transition rule that assigns a new security > > > > > type to anonymous-inode files created in some domain. The name used > > > > > for the name-based transition is the name associated with the > > > > > anonymous inode for file listings --- e.g., "[userfaultfd]" or > > > > > "[perf_event]". > > > > > > > > > > Example: > > > > > > > > > > type uffd_t; > > > > > type_transition sysadm_t sysadm_t : anon_inode uffd_t "[userfaultfd]"; > > > > > allow sysadm_t uffd_t:anon_inode { create }; > > > > > > > > > > (The next patch in this series is necessary for making userfaultfd > > > > > support this new interface. The example above is just > > > > > for exposition.) > > > > > > > > > > Signed-off-by: Daniel Colascione > > > > > Signed-off-by: Lokesh Gidra > > > > > --- > > > > > security/selinux/hooks.c| 56 > > > > > + > > > > > security/selinux/include/classmap.h | 2 ++ > > > > > 2 files changed, 58 insertions(+) > > > > > > > > > > diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c > > > > > index 6b1826fc3658..d092aa512868 100644 > > > > > --- a/security/selinux/hooks.c > > > > > +++ b/security/selinux/hooks.c > > > > > @@ -2927,6 +2927,61 @@ static int selinux_inode_init_security(struct > > > > > inode *inode, struct inode *dir, > > > > > return 0; > > > > > } > > > > > > > > > > +static int selinux_inode_init_security_anon(struct inode *inode, > > > > > + const struct qstr *name, > > > > > + const struct inode > > > > > *context_inode) > > > > > +{ > > > > > + const struct task_security_struct *tsec = > > > > > selinux_cred(current_cred()); > > > > > + struct common_audit_data ad; > > > > > + struct inode_security_struct *isec; > > > > > + int rc; > > > > > + > > > > > + if (unlikely(!selinux_initialized(_state))) > > > > > + return 0; > > > > > + > > > > > + isec = selinux_inode(inode); > > > > > + > > > > > + /* > > > > > +* We only get here once per ephemeral inode. The inode has > > > > > +* been initialized via inode_alloc_security but is otherwise > > > > > +* untouched. > > > > > +*/ > > > > > + > > > > > + if (context_inode) { > > > > > + struct inode_security_struct *context_isec = > > > > > + selinux_inode(context_inode); > > > > > + if (context_isec->initialized != LABEL_INITIALIZED) > > > > > + return -EACCES; > > Stephen, as per your explanation below, is this check also > > problematic? I mean is it possible that /dev/kvm context_inode may not > > have its label initialized? If so, then v12 of the patch series can be > > used as is. Otherwise, I will send the next version which rollbacks > > v14 and v13, except for this check. Kindly confirm. > > The context_inode should always be initialized already. I'm not fond > though of silently returning -EACCES here. At the least we should > have a pr_err() or pr_warn() here. In reality, this could only occur > in the case of a kernel bug or memory corruption so it used to be a > candidate for WARN_ON() or BUG_ON() or similar but I know that > BUG_ON() at least is frowned upon these days. Got it. I'll add a pr_err(). Thanks a lot.
Re: [PATCH v13 3/4] selinux: teach SELinux about anonymous inodes
On Fri, Jan 8, 2021 at 3:17 PM Lokesh Gidra wrote: > > On Fri, Jan 8, 2021 at 11:35 AM Stephen Smalley > wrote: > > > > On Wed, Jan 6, 2021 at 10:03 PM Paul Moore wrote: > > > > > > On Wed, Nov 11, 2020 at 8:54 PM Lokesh Gidra > > > wrote: > > > > From: Daniel Colascione > > > > > > > > This change uses the anon_inodes and LSM infrastructure introduced in > > > > the previous patches to give SELinux the ability to control > > > > anonymous-inode files that are created using the new > > > > anon_inode_getfd_secure() function. > > > > > > > > A SELinux policy author detects and controls these anonymous inodes by > > > > adding a name-based type_transition rule that assigns a new security > > > > type to anonymous-inode files created in some domain. The name used > > > > for the name-based transition is the name associated with the > > > > anonymous inode for file listings --- e.g., "[userfaultfd]" or > > > > "[perf_event]". > > > > > > > > Example: > > > > > > > > type uffd_t; > > > > type_transition sysadm_t sysadm_t : anon_inode uffd_t "[userfaultfd]"; > > > > allow sysadm_t uffd_t:anon_inode { create }; > > > > > > > > (The next patch in this series is necessary for making userfaultfd > > > > support this new interface. The example above is just > > > > for exposition.) > > > > > > > > Signed-off-by: Daniel Colascione > > > > Signed-off-by: Lokesh Gidra > > > > --- > > > > security/selinux/hooks.c| 56 + > > > > security/selinux/include/classmap.h | 2 ++ > > > > 2 files changed, 58 insertions(+) > > > > > > > > diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c > > > > index 6b1826fc3658..d092aa512868 100644 > > > > --- a/security/selinux/hooks.c > > > > +++ b/security/selinux/hooks.c > > > > @@ -2927,6 +2927,61 @@ static int selinux_inode_init_security(struct > > > > inode *inode, struct inode *dir, > > > > return 0; > > > > } > > > > > > > > +static int selinux_inode_init_security_anon(struct inode *inode, > > > > + const struct qstr *name, > > > > + const struct inode > > > > *context_inode) > > > > +{ > > > > + const struct task_security_struct *tsec = > > > > selinux_cred(current_cred()); > > > > + struct common_audit_data ad; > > > > + struct inode_security_struct *isec; > > > > + int rc; > > > > + > > > > + if (unlikely(!selinux_initialized(_state))) > > > > + return 0; > > > > + > > > > + isec = selinux_inode(inode); > > > > + > > > > + /* > > > > +* We only get here once per ephemeral inode. The inode has > > > > +* been initialized via inode_alloc_security but is otherwise > > > > +* untouched. > > > > +*/ > > > > + > > > > + if (context_inode) { > > > > + struct inode_security_struct *context_isec = > > > > + selinux_inode(context_inode); > > > > + if (context_isec->initialized != LABEL_INITIALIZED) > > > > + return -EACCES; > Stephen, as per your explanation below, is this check also > problematic? I mean is it possible that /dev/kvm context_inode may not > have its label initialized? If so, then v12 of the patch series can be > used as is. Otherwise, I will send the next version which rollbacks > v14 and v13, except for this check. Kindly confirm. The context_inode should always be initialized already. I'm not fond though of silently returning -EACCES here. At the least we should have a pr_err() or pr_warn() here. In reality, this could only occur in the case of a kernel bug or memory corruption so it used to be a candidate for WARN_ON() or BUG_ON() or similar but I know that BUG_ON() at least is frowned upon these days.
Re: [PATCH v13 3/4] selinux: teach SELinux about anonymous inodes
On Fri, Jan 8, 2021 at 2:35 PM Stephen Smalley wrote: > On Wed, Jan 6, 2021 at 10:03 PM Paul Moore wrote: > > On Wed, Nov 11, 2020 at 8:54 PM Lokesh Gidra wrote: > > > From: Daniel Colascione > > > > > > This change uses the anon_inodes and LSM infrastructure introduced in > > > the previous patches to give SELinux the ability to control > > > anonymous-inode files that are created using the new > > > anon_inode_getfd_secure() function. > > > > > > A SELinux policy author detects and controls these anonymous inodes by > > > adding a name-based type_transition rule that assigns a new security > > > type to anonymous-inode files created in some domain. The name used > > > for the name-based transition is the name associated with the > > > anonymous inode for file listings --- e.g., "[userfaultfd]" or > > > "[perf_event]". > > > > > > Example: > > > > > > type uffd_t; > > > type_transition sysadm_t sysadm_t : anon_inode uffd_t "[userfaultfd]"; > > > allow sysadm_t uffd_t:anon_inode { create }; > > > > > > (The next patch in this series is necessary for making userfaultfd > > > support this new interface. The example above is just > > > for exposition.) > > > > > > Signed-off-by: Daniel Colascione > > > Signed-off-by: Lokesh Gidra > > > --- > > > security/selinux/hooks.c| 56 + > > > security/selinux/include/classmap.h | 2 ++ > > > 2 files changed, 58 insertions(+) > > > > > > diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c > > > index 6b1826fc3658..d092aa512868 100644 > > > --- a/security/selinux/hooks.c > > > +++ b/security/selinux/hooks.c > > > @@ -2927,6 +2927,61 @@ static int selinux_inode_init_security(struct > > > inode *inode, struct inode *dir, > > > return 0; > > > } > > > > > > +static int selinux_inode_init_security_anon(struct inode *inode, > > > + const struct qstr *name, > > > + const struct inode > > > *context_inode) > > > +{ > > > + const struct task_security_struct *tsec = > > > selinux_cred(current_cred()); > > > + struct common_audit_data ad; > > > + struct inode_security_struct *isec; > > > + int rc; > > > + > > > + if (unlikely(!selinux_initialized(_state))) > > > + return 0; > > > + > > > + isec = selinux_inode(inode); > > > + > > > + /* > > > +* We only get here once per ephemeral inode. The inode has > > > +* been initialized via inode_alloc_security but is otherwise > > > +* untouched. > > > +*/ > > > + > > > + if (context_inode) { > > > + struct inode_security_struct *context_isec = > > > + selinux_inode(context_inode); > > > + if (context_isec->initialized != LABEL_INITIALIZED) > > > + return -EACCES; > > > + > > > + isec->sclass = context_isec->sclass; > > > > Taking the object class directly from the context_inode is > > interesting, and I suspect problematic. In the case below where no > > context_inode is supplied the object class is set to > > SECCLASS_ANON_INODE, which is correct, but when a context_inode is > > supplied there is no guarantee that the object class will be set to > > SECCLASS_ANON_INODE. This could both pose a problem for policy > > writers (how do you distinguish the anon inode from other normal file > > inodes in this case?) as well as an outright fault later in this > > function when we try to check the ANON_INODE__CREATE on an object > > other than a SECCLASS_ANON_INODE object. > > > > It works in the userfaultfd case because the context_inode is > > originally created with this function so the object class is correctly > > set to SECCLASS_ANON_INODE, but can we always guarantee that to be the > > case? Do we ever need or want to support using a context_inode that > > is not SECCLASS_ANON_INODE? > > Sorry, I haven't been following this. IIRC, the original reason for > passing a context_inode was to support the /dev/kvm or similar use > cases where the driver is creating anonymous inodes to represent > specific objects/interfaces derived from
Re: [PATCH v13 3/4] selinux: teach SELinux about anonymous inodes
On Fri, Jan 8, 2021 at 11:35 AM Stephen Smalley wrote: > > On Wed, Jan 6, 2021 at 10:03 PM Paul Moore wrote: > > > > On Wed, Nov 11, 2020 at 8:54 PM Lokesh Gidra wrote: > > > From: Daniel Colascione > > > > > > This change uses the anon_inodes and LSM infrastructure introduced in > > > the previous patches to give SELinux the ability to control > > > anonymous-inode files that are created using the new > > > anon_inode_getfd_secure() function. > > > > > > A SELinux policy author detects and controls these anonymous inodes by > > > adding a name-based type_transition rule that assigns a new security > > > type to anonymous-inode files created in some domain. The name used > > > for the name-based transition is the name associated with the > > > anonymous inode for file listings --- e.g., "[userfaultfd]" or > > > "[perf_event]". > > > > > > Example: > > > > > > type uffd_t; > > > type_transition sysadm_t sysadm_t : anon_inode uffd_t "[userfaultfd]"; > > > allow sysadm_t uffd_t:anon_inode { create }; > > > > > > (The next patch in this series is necessary for making userfaultfd > > > support this new interface. The example above is just > > > for exposition.) > > > > > > Signed-off-by: Daniel Colascione > > > Signed-off-by: Lokesh Gidra > > > --- > > > security/selinux/hooks.c| 56 + > > > security/selinux/include/classmap.h | 2 ++ > > > 2 files changed, 58 insertions(+) > > > > > > diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c > > > index 6b1826fc3658..d092aa512868 100644 > > > --- a/security/selinux/hooks.c > > > +++ b/security/selinux/hooks.c > > > @@ -2927,6 +2927,61 @@ static int selinux_inode_init_security(struct > > > inode *inode, struct inode *dir, > > > return 0; > > > } > > > > > > +static int selinux_inode_init_security_anon(struct inode *inode, > > > + const struct qstr *name, > > > + const struct inode > > > *context_inode) > > > +{ > > > + const struct task_security_struct *tsec = > > > selinux_cred(current_cred()); > > > + struct common_audit_data ad; > > > + struct inode_security_struct *isec; > > > + int rc; > > > + > > > + if (unlikely(!selinux_initialized(_state))) > > > + return 0; > > > + > > > + isec = selinux_inode(inode); > > > + > > > + /* > > > +* We only get here once per ephemeral inode. The inode has > > > +* been initialized via inode_alloc_security but is otherwise > > > +* untouched. > > > +*/ > > > + > > > + if (context_inode) { > > > + struct inode_security_struct *context_isec = > > > + selinux_inode(context_inode); > > > + if (context_isec->initialized != LABEL_INITIALIZED) > > > + return -EACCES; Stephen, as per your explanation below, is this check also problematic? I mean is it possible that /dev/kvm context_inode may not have its label initialized? If so, then v12 of the patch series can be used as is. Otherwise, I will send the next version which rollbacks v14 and v13, except for this check. Kindly confirm. > > > + > > > + isec->sclass = context_isec->sclass; > > > > Taking the object class directly from the context_inode is > > interesting, and I suspect problematic. In the case below where no > > context_inode is supplied the object class is set to > > SECCLASS_ANON_INODE, which is correct, but when a context_inode is > > supplied there is no guarantee that the object class will be set to > > SECCLASS_ANON_INODE. This could both pose a problem for policy > > writers (how do you distinguish the anon inode from other normal file > > inodes in this case?) as well as an outright fault later in this > > function when we try to check the ANON_INODE__CREATE on an object > > other than a SECCLASS_ANON_INODE object. > > > > It works in the userfaultfd case because the context_inode is > > originally created with this function so the object class is correctly > > set to SECCLASS_ANON_INODE, but can we always guarantee that to be the > > case? Do we ever need or want to
Re: [PATCH v14 3/4] selinux: teach SELinux about anonymous inodes
On Fri, Jan 8, 2021 at 12:33 AM Lokesh Gidra wrote: > > From: Daniel Colascione > > This change uses the anon_inodes and LSM infrastructure introduced in > the previous patches to give SELinux the ability to control > anonymous-inode files that are created using the new > anon_inode_getfd_secure() function. > > A SELinux policy author detects and controls these anonymous inodes by > adding a name-based type_transition rule that assigns a new security > type to anonymous-inode files created in some domain. The name used > for the name-based transition is the name associated with the > anonymous inode for file listings --- e.g., "[userfaultfd]" or > "[perf_event]". > > Example: > > type uffd_t; > type_transition sysadm_t sysadm_t : anon_inode uffd_t "[userfaultfd]"; > allow sysadm_t uffd_t:anon_inode { create }; > > (The next patch in this series is necessary for making userfaultfd > support this new interface. The example above is just > for exposition.) > > Signed-off-by: Daniel Colascione > Signed-off-by: Lokesh Gidra > --- > security/selinux/hooks.c| 59 + > security/selinux/include/classmap.h | 2 + > 2 files changed, 61 insertions(+) > > diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c > index 644b17ec9e63..8b4e155b2930 100644 > --- a/security/selinux/hooks.c > +++ b/security/selinux/hooks.c > @@ -2934,6 +2933,63 @@ static int selinux_inode_init_security(struct inode > *inode, struct inode *dir, > return 0; > } > > +static int selinux_inode_init_security_anon(struct inode *inode, > + const struct qstr *name, > + const struct inode *context_inode) > +{ > + const struct task_security_struct *tsec = > selinux_cred(current_cred()); > + struct common_audit_data ad; > + struct inode_security_struct *isec; > + int rc; > + > + if (unlikely(!selinux_initialized(_state))) > + return 0; > + > + isec = selinux_inode(inode); > + > + /* > +* We only get here once per ephemeral inode. The inode has > +* been initialized via inode_alloc_security but is otherwise > +* untouched. > +*/ > + isec->initialized = LABEL_INITIALIZED; > + isec->sclass = SECCLASS_ANON_INODE; > + > + if (context_inode) { > + struct inode_security_struct *context_isec = > + selinux_inode(context_inode); > + if (context_isec->initialized != LABEL_INITIALIZED) > + return -EACCES; > + if (context_isec->sclass != SECCLASS_ANON_INODE) { > + pr_err("SELinux: initializing anonymous inode with > non-anonymous inode"); > + return -EACCES; > + } This would preclude using this facility for anonymous inodes created by kvm and other use cases. Don't do this. > + > + isec->sid = context_isec->sid; > + } else { > + rc = security_transition_sid( > + _state, tsec->sid, tsec->sid, > + isec->sclass, name, >sid); > + if (rc) > + return rc; > + } > + > + /* > +* Now that we've initialized security, check whether we're > +* allowed to actually create this type of anonymous inode. > +*/ > + > + ad.type = LSM_AUDIT_DATA_INODE; > + ad.u.inode = inode; > + > + return avc_has_perm(_state, > + tsec->sid, > + isec->sid, > + isec->sclass, > + ANON_INODE__CREATE, FILE__CREATE is perfectly appropriate here, not that it makes any difference.
Re: [PATCH v13 3/4] selinux: teach SELinux about anonymous inodes
On Wed, Jan 6, 2021 at 10:03 PM Paul Moore wrote: > > On Wed, Nov 11, 2020 at 8:54 PM Lokesh Gidra wrote: > > From: Daniel Colascione > > > > This change uses the anon_inodes and LSM infrastructure introduced in > > the previous patches to give SELinux the ability to control > > anonymous-inode files that are created using the new > > anon_inode_getfd_secure() function. > > > > A SELinux policy author detects and controls these anonymous inodes by > > adding a name-based type_transition rule that assigns a new security > > type to anonymous-inode files created in some domain. The name used > > for the name-based transition is the name associated with the > > anonymous inode for file listings --- e.g., "[userfaultfd]" or > > "[perf_event]". > > > > Example: > > > > type uffd_t; > > type_transition sysadm_t sysadm_t : anon_inode uffd_t "[userfaultfd]"; > > allow sysadm_t uffd_t:anon_inode { create }; > > > > (The next patch in this series is necessary for making userfaultfd > > support this new interface. The example above is just > > for exposition.) > > > > Signed-off-by: Daniel Colascione > > Signed-off-by: Lokesh Gidra > > --- > > security/selinux/hooks.c| 56 + > > security/selinux/include/classmap.h | 2 ++ > > 2 files changed, 58 insertions(+) > > > > diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c > > index 6b1826fc3658..d092aa512868 100644 > > --- a/security/selinux/hooks.c > > +++ b/security/selinux/hooks.c > > @@ -2927,6 +2927,61 @@ static int selinux_inode_init_security(struct inode > > *inode, struct inode *dir, > > return 0; > > } > > > > +static int selinux_inode_init_security_anon(struct inode *inode, > > + const struct qstr *name, > > + const struct inode > > *context_inode) > > +{ > > + const struct task_security_struct *tsec = > > selinux_cred(current_cred()); > > + struct common_audit_data ad; > > + struct inode_security_struct *isec; > > + int rc; > > + > > + if (unlikely(!selinux_initialized(_state))) > > + return 0; > > + > > + isec = selinux_inode(inode); > > + > > + /* > > +* We only get here once per ephemeral inode. The inode has > > +* been initialized via inode_alloc_security but is otherwise > > +* untouched. > > +*/ > > + > > + if (context_inode) { > > + struct inode_security_struct *context_isec = > > + selinux_inode(context_inode); > > + if (context_isec->initialized != LABEL_INITIALIZED) > > + return -EACCES; > > + > > + isec->sclass = context_isec->sclass; > > Taking the object class directly from the context_inode is > interesting, and I suspect problematic. In the case below where no > context_inode is supplied the object class is set to > SECCLASS_ANON_INODE, which is correct, but when a context_inode is > supplied there is no guarantee that the object class will be set to > SECCLASS_ANON_INODE. This could both pose a problem for policy > writers (how do you distinguish the anon inode from other normal file > inodes in this case?) as well as an outright fault later in this > function when we try to check the ANON_INODE__CREATE on an object > other than a SECCLASS_ANON_INODE object. > > It works in the userfaultfd case because the context_inode is > originally created with this function so the object class is correctly > set to SECCLASS_ANON_INODE, but can we always guarantee that to be the > case? Do we ever need or want to support using a context_inode that > is not SECCLASS_ANON_INODE? Sorry, I haven't been following this. IIRC, the original reason for passing a context_inode was to support the /dev/kvm or similar use cases where the driver is creating anonymous inodes to represent specific objects/interfaces derived from the device node and we want to be able to control subsequent ioctl operations on those anonymous inodes in the same manner as for the device node. For example, ioctl operations on /dev/kvm can end up returning file descriptors for anonymous inodes representing a specific VM or VCPU or similar. If we propagate the security class and SID from the /dev/kvm inode (the context inode) to the new anonymous inode, we can write a single policy rule over all ioctl operations related to /dev/kvm. That's also why we used the FILE__CREATE permission here originally; that was also intentional. All the file-related classes including anon_inode inherit a common set of file permissions including create and thus we often use the FILE__ in common code when checking permission against any potentially derived class.
[PATCH v14 4/4] userfaultfd: use secure anon inodes for userfaultfd
From: Daniel Colascione This change gives userfaultfd file descriptors a real security context, allowing policy to act on them. Signed-off-by: Daniel Colascione [LG: Remove owner inode from userfaultfd_ctx] [LG: Use anon_inode_getfd_secure() instead of anon_inode_getfile_secure() in userfaultfd syscall] [LG: Use inode of file in userfaultfd_read() in resolve_userfault_fork()] Signed-off-by: Lokesh Gidra Reviewed-by: Eric Biggers --- fs/userfaultfd.c | 19 ++- 1 file changed, 10 insertions(+), 9 deletions(-) diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c index 894cc28142e7..0be8cdd4425a 100644 --- a/fs/userfaultfd.c +++ b/fs/userfaultfd.c @@ -979,14 +979,14 @@ static __poll_t userfaultfd_poll(struct file *file, poll_table *wait) static const struct file_operations userfaultfd_fops; -static int resolve_userfault_fork(struct userfaultfd_ctx *ctx, - struct userfaultfd_ctx *new, +static int resolve_userfault_fork(struct userfaultfd_ctx *new, + struct inode *inode, struct uffd_msg *msg) { int fd; - fd = anon_inode_getfd("[userfaultfd]", _fops, new, - O_RDWR | (new->flags & UFFD_SHARED_FCNTL_FLAGS)); + fd = anon_inode_getfd_secure("[userfaultfd]", _fops, new, + O_RDWR | (new->flags & UFFD_SHARED_FCNTL_FLAGS), inode); if (fd < 0) return fd; @@ -996,7 +996,7 @@ static int resolve_userfault_fork(struct userfaultfd_ctx *ctx, } static ssize_t userfaultfd_ctx_read(struct userfaultfd_ctx *ctx, int no_wait, - struct uffd_msg *msg) + struct uffd_msg *msg, struct inode *inode) { ssize_t ret; DECLARE_WAITQUEUE(wait, current); @@ -1107,7 +1107,7 @@ static ssize_t userfaultfd_ctx_read(struct userfaultfd_ctx *ctx, int no_wait, spin_unlock_irq(>fd_wqh.lock); if (!ret && msg->event == UFFD_EVENT_FORK) { - ret = resolve_userfault_fork(ctx, fork_nctx, msg); + ret = resolve_userfault_fork(fork_nctx, inode, msg); spin_lock_irq(>event_wqh.lock); if (!list_empty(_event)) { /* @@ -1167,6 +1167,7 @@ static ssize_t userfaultfd_read(struct file *file, char __user *buf, ssize_t _ret, ret = 0; struct uffd_msg msg; int no_wait = file->f_flags & O_NONBLOCK; + struct inode *inode = file_inode(file); if (ctx->state == UFFD_STATE_WAIT_API) return -EINVAL; @@ -1174,7 +1175,7 @@ static ssize_t userfaultfd_read(struct file *file, char __user *buf, for (;;) { if (count < sizeof(msg)) return ret ? ret : -EINVAL; - _ret = userfaultfd_ctx_read(ctx, no_wait, ); + _ret = userfaultfd_ctx_read(ctx, no_wait, , inode); if (_ret < 0) return ret ? ret : _ret; if (copy_to_user((__u64 __user *) buf, , sizeof(msg))) @@ -1999,8 +2000,8 @@ SYSCALL_DEFINE1(userfaultfd, int, flags) /* prevent the mm struct to be freed */ mmgrab(ctx->mm); - fd = anon_inode_getfd("[userfaultfd]", _fops, ctx, - O_RDWR | (flags & UFFD_SHARED_FCNTL_FLAGS)); + fd = anon_inode_getfd_secure("[userfaultfd]", _fops, ctx, + O_RDWR | (flags & UFFD_SHARED_FCNTL_FLAGS), NULL); if (fd < 0) { mmdrop(ctx->mm); kmem_cache_free(userfaultfd_ctx_cachep, ctx); -- 2.30.0.284.gd98b1dd5eaa7-goog
[PATCH v14 3/4] selinux: teach SELinux about anonymous inodes
From: Daniel Colascione This change uses the anon_inodes and LSM infrastructure introduced in the previous patches to give SELinux the ability to control anonymous-inode files that are created using the new anon_inode_getfd_secure() function. A SELinux policy author detects and controls these anonymous inodes by adding a name-based type_transition rule that assigns a new security type to anonymous-inode files created in some domain. The name used for the name-based transition is the name associated with the anonymous inode for file listings --- e.g., "[userfaultfd]" or "[perf_event]". Example: type uffd_t; type_transition sysadm_t sysadm_t : anon_inode uffd_t "[userfaultfd]"; allow sysadm_t uffd_t:anon_inode { create }; (The next patch in this series is necessary for making userfaultfd support this new interface. The example above is just for exposition.) Signed-off-by: Daniel Colascione Signed-off-by: Lokesh Gidra --- security/selinux/hooks.c| 59 + security/selinux/include/classmap.h | 2 + 2 files changed, 61 insertions(+) diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c index 644b17ec9e63..8b4e155b2930 100644 --- a/security/selinux/hooks.c +++ b/security/selinux/hooks.c @@ -2934,6 +2933,63 @@ static int selinux_inode_init_security(struct inode *inode, struct inode *dir, return 0; } +static int selinux_inode_init_security_anon(struct inode *inode, + const struct qstr *name, + const struct inode *context_inode) +{ + const struct task_security_struct *tsec = selinux_cred(current_cred()); + struct common_audit_data ad; + struct inode_security_struct *isec; + int rc; + + if (unlikely(!selinux_initialized(_state))) + return 0; + + isec = selinux_inode(inode); + + /* +* We only get here once per ephemeral inode. The inode has +* been initialized via inode_alloc_security but is otherwise +* untouched. +*/ + isec->initialized = LABEL_INITIALIZED; + isec->sclass = SECCLASS_ANON_INODE; + + if (context_inode) { + struct inode_security_struct *context_isec = + selinux_inode(context_inode); + if (context_isec->initialized != LABEL_INITIALIZED) + return -EACCES; + if (context_isec->sclass != SECCLASS_ANON_INODE) { + pr_err("SELinux: initializing anonymous inode with non-anonymous inode"); + return -EACCES; + } + + isec->sid = context_isec->sid; + } else { + rc = security_transition_sid( + _state, tsec->sid, tsec->sid, + isec->sclass, name, >sid); + if (rc) + return rc; + } + + /* +* Now that we've initialized security, check whether we're +* allowed to actually create this type of anonymous inode. +*/ + + ad.type = LSM_AUDIT_DATA_INODE; + ad.u.inode = inode; + + return avc_has_perm(_state, + tsec->sid, + isec->sid, + isec->sclass, + ANON_INODE__CREATE, + ); +} + static int selinux_inode_create(struct inode *dir, struct dentry *dentry, umode_t mode) { return may_create(dir, dentry, SECCLASS_FILE); @@ -7000,6 +7057,7 @@ static struct security_hook_list selinux_hooks[] __lsm_ro_after_init = { LSM_HOOK_INIT(inode_free_security, selinux_inode_free_security), LSM_HOOK_INIT(inode_init_security, selinux_inode_init_security), + LSM_HOOK_INIT(inode_init_security_anon, selinux_inode_init_security_anon), LSM_HOOK_INIT(inode_create, selinux_inode_create), LSM_HOOK_INIT(inode_link, selinux_inode_link), LSM_HOOK_INIT(inode_unlink, selinux_inode_unlink), diff --git a/security/selinux/include/classmap.h b/security/selinux/include/classmap.h index 40cebde62856..ba2e01a6955c 100644 --- a/security/selinux/include/classmap.h +++ b/security/selinux/include/classmap.h @@ -249,6 +249,8 @@ struct security_class_mapping secclass_map[] = { {"open", "cpu", "kernel", "tracepoint", "read", "write"} }, { "lockdown", { "integrity", "confidentiality", NULL } }, + { "anon_inode", + { COMMON_FILE_PERMS, NULL } }, { NULL } }; -- 2.30.0.284.gd98b1dd5eaa7-goog
[PATCH v14 0/4] SELinux support for anonymous inodes and UFFD
Userfaultfd in unprivileged contexts could be potentially very useful. We'd like to harden userfaultfd to make such unprivileged use less risky. This patch series allows SELinux to manage userfaultfd file descriptors and in the future, other kinds of anonymous-inode-based file descriptor. SELinux policy authors can apply policy types to anonymous inodes by providing name-based transition rules keyed off the anonymous inode internal name ( "[userfaultfd]" in the case of userfaultfd(2) file descriptors) and applying policy to the new SIDs thus produced. With SELinux managed userfaultfd, an admin can control creation and movement of the file descriptors. In particular, handling of a userfaultfd descriptor by a different process is essentially a ptrace access into the process, without any of the corresponding security_ptrace_access_check() checks. For privacy, the admin may want to deny such accesses, which is possible with SELinux support. Inside the kernel, a new anon_inode interface, anon_inode_getfd_secure, allows callers to opt into this SELinux management. In this new "secure" mode, anon_inodes create new ephemeral inodes for anonymous file objects instead of reusing the normal anon_inodes singleton dummy inode. A new LSM hook gives security modules an opportunity to configure and veto these ephemeral inodes. This patch series is one of two fork of [1] and is an alternative to [2]. The primary difference between the two patch series is that this partch series creates a unique inode for each "secure" anonymous inode, while the other patch series ([2]) continues using the singleton dummy anonymous inode and adds a way to attach SELinux security information directly to file objects. I prefer the approach in this patch series because 1) it's a smaller patch than [2], and 2) it produces a more regular security architecture: in this patch series, secure anonymous inodes aren't S_PRIVATE and they maintain the SELinux property that the label for a file is in its inode. We do need an additional inode per anonymous file, but per-struct-file inode creation doesn't seem to be a problem for pipes and sockets. The previous version of this feature ([1]) created a new SELinux security class for userfaultfd file descriptors. This version adopts the generic transition-based approach of [2]. This patch series also differs from [2] in that it doesn't affect all anonymous inodes right away --- instead requiring anon_inodes callers to opt in --- but this difference isn't one of basic approach. The important question to resolve is whether we should be creating new inodes or enhancing per-file data. Changes from the first version of the patch: - Removed some error checks - Defined a new anon_inode SELinux class to resolve the ambiguity in [3] - Inherit sclass as well as descriptor from context inode Changes from the second version of the patch: - Fixed example policy in the commit message to reflect the use of the new anon_inode class. Changes from the third version of the patch: - Dropped the fops parameter to the LSM hook - Documented hook parameters - Fixed incorrect class used for SELinux transition - Removed stray UFFD changed early in the series - Removed a redundant ERR_PTR(PTR_ERR()) Changes from the fourth version of the patch: - Removed an unused parameter from an internal function - Fixed function documentation Changes from the fifth version of the patch: - Fixed function documentation in fs/anon_inodes.c and include/linux/lsm_hooks.h - Used anon_inode_getfd_secure() in userfaultfd() syscall and removed owner from userfaultfd_ctx. Changes from the sixth version of the patch: - Removed definition of anon_inode_getfile_secure() as there are no callers. - Simplified function description of anon_inode_getfd_secure(). - Elaborated more on the purpose of 'context_inode' in commit message. Changes from the seventh version of the patch: - Fixed error handling in _anon_inode_getfile(). - Fixed minor comment and indentation related issues. Changes from the eighth version of the patch: - Replaced selinux_state.initialized with selinux_state.initialized Changes from the ninth version of the patch: - Fixed function names in fs/anon_inodes.c - Fixed comment of anon_inode_getfd_secure() - Fixed name of the patch wherein userfaultfd code uses anon_inode_getfd_secure() Changes from the tenth version of the patch: - Split first patch into VFS and LSM specific patches - Fixed comments in fs/anon_inodes.c - Fixed comment of alloc_anon_inode() Changes from the eleventh version of the patch: - Removed comment of alloc_anon_inode() for consistency with the code - Fixed explanation of LSM hook in the commit message Changes from the twelfth version of the patch: - Replaced FILE__CREATE with ANON_INODE__CREATE while initializing anon-inode's SELinux security struct. - Check context_inode's SE
Re: [PATCH v13 3/4] selinux: teach SELinux about anonymous inodes
On Thu, Jan 7, 2021 at 2:30 PM Paul Moore wrote: > > On Wed, Jan 6, 2021 at 10:55 PM Lokesh Gidra wrote: > > On Wed, Jan 6, 2021 at 7:03 PM Paul Moore wrote: > > > On Wed, Nov 11, 2020 at 8:54 PM Lokesh Gidra > > > wrote: > > > > From: Daniel Colascione > > > > > > > > This change uses the anon_inodes and LSM infrastructure introduced in > > > > the previous patches to give SELinux the ability to control > > > > anonymous-inode files that are created using the new > > > > anon_inode_getfd_secure() function. > > > > > > > > A SELinux policy author detects and controls these anonymous inodes by > > > > adding a name-based type_transition rule that assigns a new security > > > > type to anonymous-inode files created in some domain. The name used > > > > for the name-based transition is the name associated with the > > > > anonymous inode for file listings --- e.g., "[userfaultfd]" or > > > > "[perf_event]". > > > > > > > > Example: > > > > > > > > type uffd_t; > > > > type_transition sysadm_t sysadm_t : anon_inode uffd_t "[userfaultfd]"; > > > > allow sysadm_t uffd_t:anon_inode { create }; > > > > > > > > (The next patch in this series is necessary for making userfaultfd > > > > support this new interface. The example above is just > > > > for exposition.) > > > > > > > > Signed-off-by: Daniel Colascione > > > > Signed-off-by: Lokesh Gidra > > > > --- > > > > security/selinux/hooks.c| 56 + > > > > security/selinux/include/classmap.h | 2 ++ > > > > 2 files changed, 58 insertions(+) > > > > > > > > diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c > > > > index 6b1826fc3658..d092aa512868 100644 > > > > --- a/security/selinux/hooks.c > > > > +++ b/security/selinux/hooks.c > > > > @@ -2927,6 +2927,61 @@ static int selinux_inode_init_security(struct > > > > inode *inode, struct inode *dir, > > > > return 0; > > > > } > > > > > > > > +static int selinux_inode_init_security_anon(struct inode *inode, > > > > + const struct qstr *name, > > > > + const struct inode > > > > *context_inode) > > > > +{ > > > > + const struct task_security_struct *tsec = > > > > selinux_cred(current_cred()); > > > > + struct common_audit_data ad; > > > > + struct inode_security_struct *isec; > > > > + int rc; > > > > + > > > > + if (unlikely(!selinux_initialized(_state))) > > > > + return 0; > > > > + > > > > + isec = selinux_inode(inode); > > > > + > > > > + /* > > > > +* We only get here once per ephemeral inode. The inode has > > > > +* been initialized via inode_alloc_security but is otherwise > > > > +* untouched. > > > > +*/ > > > > + > > > > + if (context_inode) { > > > > + struct inode_security_struct *context_isec = > > > > + selinux_inode(context_inode); > > > > + if (context_isec->initialized != LABEL_INITIALIZED) > > > > + return -EACCES; > > > > + > > > > + isec->sclass = context_isec->sclass; > > > > > > Taking the object class directly from the context_inode is > > > interesting, and I suspect problematic. In the case below where no > > > context_inode is supplied the object class is set to > > > SECCLASS_ANON_INODE, which is correct, but when a context_inode is > > > supplied there is no guarantee that the object class will be set to > > > SECCLASS_ANON_INODE. This could both pose a problem for policy > > > writers (how do you distinguish the anon inode from other normal file > > > inodes in this case?) as well as an outright fault later in this > > > function when we try to check the ANON_INODE__CREATE on an object > > > other than a SECCLASS_ANON_INODE object. > > > > > Thanks for catching this. I'll initialize 'sclass' unconditionally to > > SECCLASS
Re: [PATCH v13 3/4] selinux: teach SELinux about anonymous inodes
On Wed, Jan 6, 2021 at 10:55 PM Lokesh Gidra wrote: > On Wed, Jan 6, 2021 at 7:03 PM Paul Moore wrote: > > On Wed, Nov 11, 2020 at 8:54 PM Lokesh Gidra wrote: > > > From: Daniel Colascione > > > > > > This change uses the anon_inodes and LSM infrastructure introduced in > > > the previous patches to give SELinux the ability to control > > > anonymous-inode files that are created using the new > > > anon_inode_getfd_secure() function. > > > > > > A SELinux policy author detects and controls these anonymous inodes by > > > adding a name-based type_transition rule that assigns a new security > > > type to anonymous-inode files created in some domain. The name used > > > for the name-based transition is the name associated with the > > > anonymous inode for file listings --- e.g., "[userfaultfd]" or > > > "[perf_event]". > > > > > > Example: > > > > > > type uffd_t; > > > type_transition sysadm_t sysadm_t : anon_inode uffd_t "[userfaultfd]"; > > > allow sysadm_t uffd_t:anon_inode { create }; > > > > > > (The next patch in this series is necessary for making userfaultfd > > > support this new interface. The example above is just > > > for exposition.) > > > > > > Signed-off-by: Daniel Colascione > > > Signed-off-by: Lokesh Gidra > > > --- > > > security/selinux/hooks.c| 56 + > > > security/selinux/include/classmap.h | 2 ++ > > > 2 files changed, 58 insertions(+) > > > > > > diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c > > > index 6b1826fc3658..d092aa512868 100644 > > > --- a/security/selinux/hooks.c > > > +++ b/security/selinux/hooks.c > > > @@ -2927,6 +2927,61 @@ static int selinux_inode_init_security(struct > > > inode *inode, struct inode *dir, > > > return 0; > > > } > > > > > > +static int selinux_inode_init_security_anon(struct inode *inode, > > > + const struct qstr *name, > > > + const struct inode > > > *context_inode) > > > +{ > > > + const struct task_security_struct *tsec = > > > selinux_cred(current_cred()); > > > + struct common_audit_data ad; > > > + struct inode_security_struct *isec; > > > + int rc; > > > + > > > + if (unlikely(!selinux_initialized(_state))) > > > + return 0; > > > + > > > + isec = selinux_inode(inode); > > > + > > > + /* > > > +* We only get here once per ephemeral inode. The inode has > > > +* been initialized via inode_alloc_security but is otherwise > > > +* untouched. > > > +*/ > > > + > > > + if (context_inode) { > > > + struct inode_security_struct *context_isec = > > > + selinux_inode(context_inode); > > > + if (context_isec->initialized != LABEL_INITIALIZED) > > > + return -EACCES; > > > + > > > + isec->sclass = context_isec->sclass; > > > > Taking the object class directly from the context_inode is > > interesting, and I suspect problematic. In the case below where no > > context_inode is supplied the object class is set to > > SECCLASS_ANON_INODE, which is correct, but when a context_inode is > > supplied there is no guarantee that the object class will be set to > > SECCLASS_ANON_INODE. This could both pose a problem for policy > > writers (how do you distinguish the anon inode from other normal file > > inodes in this case?) as well as an outright fault later in this > > function when we try to check the ANON_INODE__CREATE on an object > > other than a SECCLASS_ANON_INODE object. > > > Thanks for catching this. I'll initialize 'sclass' unconditionally to > SECCLASS_ANON_INODE in the next version. Also, do you think I should > add a check that context_inode's sclass must be SECCLASS_ANON_INODE to > confirm that we never receive a regular inode as context_inode? This is one of the reasons why I was asking if you ever saw the need to use a regular inode here. It seems much safer to me to add a check to ensure that context_inode is SECCLASS_ANON_INODE and return an error otherwise; I would also suggest emitting an error using pr_err() with something along the lines of &quo
[PATCH 4.19 7/7] proc: Use a list of inodes to flush from proc
From: "Eric W. Biederman" [ Upstream commit 7bc3e6e55acf065500a24621f3b313e7e5998acf ] Rework the flushing of proc to use a list of directory inodes that need to be flushed. The list is kept on struct pid not on struct task_struct, as there is a fixed connection between proc inodes and pids but at least for the case of de_thread the pid of a task_struct changes. This removes the dependency on proc_mnt which allows for different mounts of proc having different mount options even in the same pid namespace and this allows for the removal of proc_mnt which will trivially the first mount of proc to honor it's mount options. This flushing remains an optimization. The functions pid_delete_dentry and pid_revalidate ensure that ordinary dcache management will not attempt to use dentries past the point their respective task has died. When unused the shrinker will eventually be able to remove these dentries. There is a case in de_thread where proc_flush_pid can be called early for a given pid. Which winds up being safe (if suboptimal) as this is just an optiimization. Only pid directories are put on the list as the other per pid files are children of those directories and d_invalidate on the directory will get them as well. So that the pid can be used during flushing it's reference count is taken in release_task and dropped in proc_flush_pid. Further the call of proc_flush_pid is moved after the tasklist_lock is released in release_task so that it is certain that the pid has already been unhashed when flushing it taking place. This removes a small race where a dentry could recreated. As struct pid is supposed to be small and I need a per pid lock I reuse the only lock that currently exists in struct pid the the wait_pidfd.lock. The net result is that this adds all of this functionality with just a little extra list management overhead and a single extra pointer in struct pid. v2: Initialize pid->inodes. I somehow failed to get that initialization into the initial version of the patch. A boot failure was reported by "kernel test robot ", and failure to initialize that pid->inodes matches all of the reported symptoms. Signed-off-by: Eric W. Biederman Fixes: f333c700c610 ("pidns: Add a limit on the number of pid namespaces") Fixes: 60347f6716aa ("pid namespaces: prepare proc_flust_task() to flush entries from multiple proc trees") Cc: # 4.19.x: b3e583825266: clone: add CLONE_PIDFD Cc: # 4.19.x: b53b0b9d9a61: pidfd: add polling support Cc: # 4.19.x: 0afa5ca82212: proc: Rename in proc_inode rename sysctl_inodes sibling_inodes Cc: # 4.19.x: 26dbc60f385f: proc: Generalize proc_sys_prune_dcache into proc_prune_siblings_dcache Cc: # 4.19.x: 71448011ea2a: proc: Clear the pieces of proc_inode that proc_evict_inode cares about Cc: # 4.19.x: f90f3cafe8d5: Use d_invalidate in proc_prune_siblings_dcache Cc: # 4.19.x Signed-off-by: Wen Yang --- fs/proc/base.c | 111 fs/proc/inode.c | 2 +- fs/proc/internal.h | 1 + include/linux/pid.h | 1 + include/linux/proc_fs.h | 4 +- kernel/exit.c | 4 +- kernel/pid.c| 1 + 7 files changed, 45 insertions(+), 79 deletions(-) diff --git a/fs/proc/base.c b/fs/proc/base.c index 5e705fa..ea74c7c 100644 --- a/fs/proc/base.c +++ b/fs/proc/base.c @@ -1741,11 +1741,25 @@ void task_dump_owner(struct task_struct *task, umode_t mode, *rgid = gid; } +void proc_pid_evict_inode(struct proc_inode *ei) +{ + struct pid *pid = ei->pid; + + if (S_ISDIR(ei->vfs_inode.i_mode)) { + spin_lock(>wait_pidfd.lock); + hlist_del_init_rcu(>sibling_inodes); + spin_unlock(>wait_pidfd.lock); + } + + put_pid(pid); +} + struct inode *proc_pid_make_inode(struct super_block * sb, struct task_struct *task, umode_t mode) { struct inode * inode; struct proc_inode *ei; + struct pid *pid; /* We need a new inode */ @@ -1763,10 +1777,18 @@ struct inode *proc_pid_make_inode(struct super_block * sb, /* * grab the reference to task. */ - ei->pid = get_task_pid(task, PIDTYPE_PID); - if (!ei->pid) + pid = get_task_pid(task, PIDTYPE_PID); + if (!pid) goto out_unlock; + /* Let the pid remember us for quick removal */ + ei->pid = pid; + if (S_ISDIR(mode)) { + spin_lock(>wait_pidfd.lock); + hlist_add_head_rcu(>sibling_inodes, >inodes); + spin_unlock(>wait_pidfd.lock); + } + task_dump_owner(task, 0, >i_uid, >i_gid); security_task_to_inode(task, inode); @@ -3067,90 +3089,29 @@ static struct dentry *proc_tgid_base_lookup(struct inode *dir, struct dentry *de .permission = proc_pid_permis
[PATCH v2 4.9 10/10] proc: Use a list of inodes to flush from proc
From: "Eric W. Biederman" [ Upstream commit 7bc3e6e55acf065500a24621f3b313e7e5998acf ] Rework the flushing of proc to use a list of directory inodes that need to be flushed. The list is kept on struct pid not on struct task_struct, as there is a fixed connection between proc inodes and pids but at least for the case of de_thread the pid of a task_struct changes. This removes the dependency on proc_mnt which allows for different mounts of proc having different mount options even in the same pid namespace and this allows for the removal of proc_mnt which will trivially the first mount of proc to honor it's mount options. This flushing remains an optimization. The functions pid_delete_dentry and pid_revalidate ensure that ordinary dcache management will not attempt to use dentries past the point their respective task has died. When unused the shrinker will eventually be able to remove these dentries. There is a case in de_thread where proc_flush_pid can be called early for a given pid. Which winds up being safe (if suboptimal) as this is just an optiimization. Only pid directories are put on the list as the other per pid files are children of those directories and d_invalidate on the directory will get them as well. So that the pid can be used during flushing it's reference count is taken in release_task and dropped in proc_flush_pid. Further the call of proc_flush_pid is moved after the tasklist_lock is released in release_task so that it is certain that the pid has already been unhashed when flushing it taking place. This removes a small race where a dentry could recreated. As struct pid is supposed to be small and I need a per pid lock I reuse the only lock that currently exists in struct pid the the wait_pidfd.lock. The net result is that this adds all of this functionality with just a little extra list management overhead and a single extra pointer in struct pid. v2: Initialize pid->inodes. I somehow failed to get that initialization into the initial version of the patch. A boot failure was reported by "kernel test robot ", and failure to initialize that pid->inodes matches all of the reported symptoms. Signed-off-by: Eric W. Biederman Fixes: f333c700c610 ("pidns: Add a limit on the number of pid namespaces") Fixes: 60347f6716aa ("pid namespaces: prepare proc_flust_task() to flush entries from multiple proc trees") Cc: # 4.9.x: b3e583825266: clone: add CLONE_PIDFD Cc: # 4.9.x: b53b0b9d9a61: pidfd: add polling support Cc: # 4.9.x: db978da8fa1d: proc: Pass file mode to proc_pid_make_inode Cc: # 4.9.x: 68eb94f16227: proc: Better ownership of files for non-dumpable tasks in user namespaces Cc: # 4.9.x: e3912ac37e07: proc: use %u for pid printing and slightly less stack Cc: # 4.9.x: 0afa5ca82212: proc: Rename in proc_inode rename sysctl_inodes sibling_inodes Cc: # 4.9.x: 26dbc60f385f: proc: Generalize proc_sys_prune_dcache into proc_prune_siblings_dcache Cc: # 4.9.x: 71448011ea2a: proc: Clear the pieces of proc_inode that proc_evict_inode cares about Cc: # 4.9.x: f90f3cafe8d5: Use d_invalidate in proc_prune_siblings_dcache Cc: # 4.9.x (proc: fix up cherry-pick conflicts for 7bc3e6e55acf) Signed-off-by: Wen Yang --- fs/proc/base.c | 111 fs/proc/inode.c | 2 +- fs/proc/internal.h | 1 + include/linux/pid.h | 1 + include/linux/proc_fs.h | 4 +- kernel/exit.c | 5 ++- kernel/pid.c| 1 + 7 files changed, 45 insertions(+), 80 deletions(-) diff --git a/fs/proc/base.c b/fs/proc/base.c index 3502a40..11caf35 100644 --- a/fs/proc/base.c +++ b/fs/proc/base.c @@ -1728,11 +1728,25 @@ void task_dump_owner(struct task_struct *task, mode_t mode, *rgid = gid; } +void proc_pid_evict_inode(struct proc_inode *ei) +{ + struct pid *pid = ei->pid; + + if (S_ISDIR(ei->vfs_inode.i_mode)) { + spin_lock(>wait_pidfd.lock); + hlist_del_init_rcu(>sibling_inodes); + spin_unlock(>wait_pidfd.lock); + } + + put_pid(pid); +} + struct inode *proc_pid_make_inode(struct super_block * sb, struct task_struct *task, umode_t mode) { struct inode * inode; struct proc_inode *ei; + struct pid *pid; /* We need a new inode */ @@ -1750,10 +1764,18 @@ struct inode *proc_pid_make_inode(struct super_block * sb, /* * grab the reference to task. */ - ei->pid = get_task_pid(task, PIDTYPE_PID); - if (!ei->pid) + pid = get_task_pid(task, PIDTYPE_PID); + if (!pid) goto out_unlock; + /* Let the pid remember us for quick removal */ + ei->pid = pid; + if (S_ISDIR(mode)) { + spin_lock(>wait_pidfd.lock); + hlist_add_head_rcu(>sibling_inodes, >inodes); + spi
Re: [PATCH v13 3/4] selinux: teach SELinux about anonymous inodes
On Wed, Jan 6, 2021 at 7:03 PM Paul Moore wrote: > > On Wed, Nov 11, 2020 at 8:54 PM Lokesh Gidra wrote: > > From: Daniel Colascione > > > > This change uses the anon_inodes and LSM infrastructure introduced in > > the previous patches to give SELinux the ability to control > > anonymous-inode files that are created using the new > > anon_inode_getfd_secure() function. > > > > A SELinux policy author detects and controls these anonymous inodes by > > adding a name-based type_transition rule that assigns a new security > > type to anonymous-inode files created in some domain. The name used > > for the name-based transition is the name associated with the > > anonymous inode for file listings --- e.g., "[userfaultfd]" or > > "[perf_event]". > > > > Example: > > > > type uffd_t; > > type_transition sysadm_t sysadm_t : anon_inode uffd_t "[userfaultfd]"; > > allow sysadm_t uffd_t:anon_inode { create }; > > > > (The next patch in this series is necessary for making userfaultfd > > support this new interface. The example above is just > > for exposition.) > > > > Signed-off-by: Daniel Colascione > > Signed-off-by: Lokesh Gidra > > --- > > security/selinux/hooks.c| 56 + > > security/selinux/include/classmap.h | 2 ++ > > 2 files changed, 58 insertions(+) > > > > diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c > > index 6b1826fc3658..d092aa512868 100644 > > --- a/security/selinux/hooks.c > > +++ b/security/selinux/hooks.c > > @@ -2927,6 +2927,61 @@ static int selinux_inode_init_security(struct inode > > *inode, struct inode *dir, > > return 0; > > } > > > > +static int selinux_inode_init_security_anon(struct inode *inode, > > + const struct qstr *name, > > + const struct inode > > *context_inode) > > +{ > > + const struct task_security_struct *tsec = > > selinux_cred(current_cred()); > > + struct common_audit_data ad; > > + struct inode_security_struct *isec; > > + int rc; > > + > > + if (unlikely(!selinux_initialized(_state))) > > + return 0; > > + > > + isec = selinux_inode(inode); > > + > > + /* > > +* We only get here once per ephemeral inode. The inode has > > +* been initialized via inode_alloc_security but is otherwise > > +* untouched. > > +*/ > > + > > + if (context_inode) { > > + struct inode_security_struct *context_isec = > > + selinux_inode(context_inode); > > + if (context_isec->initialized != LABEL_INITIALIZED) > > + return -EACCES; > > + > > + isec->sclass = context_isec->sclass; > > Taking the object class directly from the context_inode is > interesting, and I suspect problematic. In the case below where no > context_inode is supplied the object class is set to > SECCLASS_ANON_INODE, which is correct, but when a context_inode is > supplied there is no guarantee that the object class will be set to > SECCLASS_ANON_INODE. This could both pose a problem for policy > writers (how do you distinguish the anon inode from other normal file > inodes in this case?) as well as an outright fault later in this > function when we try to check the ANON_INODE__CREATE on an object > other than a SECCLASS_ANON_INODE object. > Thanks for catching this. I'll initialize 'sclass' unconditionally to SECCLASS_ANON_INODE in the next version. Also, do you think I should add a check that context_inode's sclass must be SECCLASS_ANON_INODE to confirm that we never receive a regular inode as context_inode? > It works in the userfaultfd case because the context_inode is > originally created with this function so the object class is correctly > set to SECCLASS_ANON_INODE, but can we always guarantee that to be the > case? Do we ever need or want to support using a context_inode that > is not SECCLASS_ANON_INODE? > I don't think there is any requirement of supporting context_inode which isn't anon-inode. And even if there is, as you described earlier, for ANON_INODE__CREATE to work the sclass has to be SECCLASS_ANON_INODE. I'll appreciate comments on this from others, particularly Daniel and Stephen who originally discussed and implemented this patch. > > + isec->sid = context_isec->sid; > > + } else { > > + isec-&
Re: [PATCH v13 3/4] selinux: teach SELinux about anonymous inodes
On Wed, Nov 11, 2020 at 8:54 PM Lokesh Gidra wrote: > From: Daniel Colascione > > This change uses the anon_inodes and LSM infrastructure introduced in > the previous patches to give SELinux the ability to control > anonymous-inode files that are created using the new > anon_inode_getfd_secure() function. > > A SELinux policy author detects and controls these anonymous inodes by > adding a name-based type_transition rule that assigns a new security > type to anonymous-inode files created in some domain. The name used > for the name-based transition is the name associated with the > anonymous inode for file listings --- e.g., "[userfaultfd]" or > "[perf_event]". > > Example: > > type uffd_t; > type_transition sysadm_t sysadm_t : anon_inode uffd_t "[userfaultfd]"; > allow sysadm_t uffd_t:anon_inode { create }; > > (The next patch in this series is necessary for making userfaultfd > support this new interface. The example above is just > for exposition.) > > Signed-off-by: Daniel Colascione > Signed-off-by: Lokesh Gidra > --- > security/selinux/hooks.c| 56 + > security/selinux/include/classmap.h | 2 ++ > 2 files changed, 58 insertions(+) > > diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c > index 6b1826fc3658..d092aa512868 100644 > --- a/security/selinux/hooks.c > +++ b/security/selinux/hooks.c > @@ -2927,6 +2927,61 @@ static int selinux_inode_init_security(struct inode > *inode, struct inode *dir, > return 0; > } > > +static int selinux_inode_init_security_anon(struct inode *inode, > + const struct qstr *name, > + const struct inode *context_inode) > +{ > + const struct task_security_struct *tsec = > selinux_cred(current_cred()); > + struct common_audit_data ad; > + struct inode_security_struct *isec; > + int rc; > + > + if (unlikely(!selinux_initialized(_state))) > + return 0; > + > + isec = selinux_inode(inode); > + > + /* > +* We only get here once per ephemeral inode. The inode has > +* been initialized via inode_alloc_security but is otherwise > +* untouched. > +*/ > + > + if (context_inode) { > + struct inode_security_struct *context_isec = > + selinux_inode(context_inode); > + if (context_isec->initialized != LABEL_INITIALIZED) > + return -EACCES; > + > + isec->sclass = context_isec->sclass; Taking the object class directly from the context_inode is interesting, and I suspect problematic. In the case below where no context_inode is supplied the object class is set to SECCLASS_ANON_INODE, which is correct, but when a context_inode is supplied there is no guarantee that the object class will be set to SECCLASS_ANON_INODE. This could both pose a problem for policy writers (how do you distinguish the anon inode from other normal file inodes in this case?) as well as an outright fault later in this function when we try to check the ANON_INODE__CREATE on an object other than a SECCLASS_ANON_INODE object. It works in the userfaultfd case because the context_inode is originally created with this function so the object class is correctly set to SECCLASS_ANON_INODE, but can we always guarantee that to be the case? Do we ever need or want to support using a context_inode that is not SECCLASS_ANON_INODE? > + isec->sid = context_isec->sid; > + } else { > + isec->sclass = SECCLASS_ANON_INODE; > + rc = security_transition_sid( > + _state, tsec->sid, tsec->sid, > + isec->sclass, name, >sid); > + if (rc) > + return rc; > + } > + > + isec->initialized = LABEL_INITIALIZED; > + > + /* > +* Now that we've initialized security, check whether we're > +* allowed to actually create this type of anonymous inode. > +*/ > + > + ad.type = LSM_AUDIT_DATA_INODE; > + ad.u.inode = inode; > + > + return avc_has_perm(_state, > + tsec->sid, > + isec->sid, > + isec->sclass, > + ANON_INODE__CREATE, > + ); > +} -- paul moore www.paul-moore.com
[PATCH 4.4 117/132] Btrfs: fix selftests failure due to uninitialized i_mode in test inodes
From: Filipe Manana commit 9f7fec0ba89108b9385f1b9fb167861224912a4a upstream Some of the self tests create a test inode, setup some extents and then do calls to btrfs_get_extent() to test that the corresponding extent maps exist and are correct. However btrfs_get_extent(), since the 5.2 merge window, now errors out when it finds a regular or prealloc extent for an inode that does not correspond to a regular file (its ->i_mode is not S_IFREG). This causes the self tests to fail sometimes, specially when KASAN, slub_debug and page poisoning are enabled: $ modprobe btrfs modprobe: ERROR: could not insert 'btrfs': Invalid argument $ dmesg [ 9414.691648] Btrfs loaded, crc32c=crc32c-intel, debug=on, assert=on, integrity-checker=on, ref-verify=on [ 9414.692655] BTRFS: selftest: sectorsize: 4096 nodesize: 4096 [ 9414.692658] BTRFS: selftest: running btrfs free space cache tests [ 9414.692918] BTRFS: selftest: running extent only tests [ 9414.693061] BTRFS: selftest: running bitmap only tests [ 9414.693366] BTRFS: selftest: running bitmap and extent tests [ 9414.696455] BTRFS: selftest: running space stealing from bitmap to extent tests [ 9414.697131] BTRFS: selftest: running extent buffer operation tests [ 9414.697133] BTRFS: selftest: running btrfs_split_item tests [ 9414.697564] BTRFS: selftest: running extent I/O tests [ 9414.697583] BTRFS: selftest: running find delalloc tests [ 9415.081125] BTRFS: selftest: running find_first_clear_extent_bit test [ 9415.081278] BTRFS: selftest: running extent buffer bitmap tests [ 9415.124192] BTRFS: selftest: running inode tests [ 9415.124195] BTRFS: selftest: running btrfs_get_extent tests [ 9415.127909] BTRFS: selftest: running hole first btrfs_get_extent test [ 9415.128343] BTRFS critical (device (efault)): regular/prealloc extent found for non-regular inode 256 [ 9415.131428] BTRFS: selftest: fs/btrfs/tests/inode-tests.c:904 expected a real extent, got 0 This happens because the test inodes are created without ever initializing the i_mode field of the inode, and neither VFS's new_inode() nor the btrfs callback btrfs_alloc_inode() initialize the i_mode. Initialization of the i_mode is done through the various callbacks used by the VFS to create new inodes (regular files, directories, symlinks, tmpfiles, etc), which all call btrfs_new_inode() which in turn calls inode_init_owner(), which sets the inode's i_mode. Since the tests only uses new_inode() to create the test inodes, the i_mode was never initialized. This always happens on a VM I used with kasan, slub_debug and many other debug facilities enabled. It also happened to someone who reported this on bugzilla (on a 5.3-rc). Fix this by setting i_mode to S_IFREG at btrfs_new_test_inode(). Fixes: 6bf9e4bd6a2778 ("btrfs: inode: Verify inode mode to avoid NULL pointer dereference") Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=204397 Signed-off-by: Filipe Manana Reviewed-by: Qu Wenruo Signed-off-by: David Sterba Signed-off-by: Sudip Mukherjee Signed-off-by: Greg Kroah-Hartman --- fs/btrfs/tests/btrfs-tests.c |8 +++- 1 file changed, 7 insertions(+), 1 deletion(-) --- a/fs/btrfs/tests/btrfs-tests.c +++ b/fs/btrfs/tests/btrfs-tests.c @@ -48,7 +48,13 @@ static struct file_system_type test_type struct inode *btrfs_new_test_inode(void) { - return new_inode(test_mnt->mnt_sb); + struct inode *inode; + + inode = new_inode(test_mnt->mnt_sb); + if (inode) + inode_init_owner(inode, NULL, S_IFREG); + + return inode; } int btrfs_init_test_fs(void)
[PATCH 4.9 153/175] Btrfs: fix selftests failure due to uninitialized i_mode in test inodes
From: Filipe Manana commit 9f7fec0ba89108b9385f1b9fb167861224912a4a upstream Some of the self tests create a test inode, setup some extents and then do calls to btrfs_get_extent() to test that the corresponding extent maps exist and are correct. However btrfs_get_extent(), since the 5.2 merge window, now errors out when it finds a regular or prealloc extent for an inode that does not correspond to a regular file (its ->i_mode is not S_IFREG). This causes the self tests to fail sometimes, specially when KASAN, slub_debug and page poisoning are enabled: $ modprobe btrfs modprobe: ERROR: could not insert 'btrfs': Invalid argument $ dmesg [ 9414.691648] Btrfs loaded, crc32c=crc32c-intel, debug=on, assert=on, integrity-checker=on, ref-verify=on [ 9414.692655] BTRFS: selftest: sectorsize: 4096 nodesize: 4096 [ 9414.692658] BTRFS: selftest: running btrfs free space cache tests [ 9414.692918] BTRFS: selftest: running extent only tests [ 9414.693061] BTRFS: selftest: running bitmap only tests [ 9414.693366] BTRFS: selftest: running bitmap and extent tests [ 9414.696455] BTRFS: selftest: running space stealing from bitmap to extent tests [ 9414.697131] BTRFS: selftest: running extent buffer operation tests [ 9414.697133] BTRFS: selftest: running btrfs_split_item tests [ 9414.697564] BTRFS: selftest: running extent I/O tests [ 9414.697583] BTRFS: selftest: running find delalloc tests [ 9415.081125] BTRFS: selftest: running find_first_clear_extent_bit test [ 9415.081278] BTRFS: selftest: running extent buffer bitmap tests [ 9415.124192] BTRFS: selftest: running inode tests [ 9415.124195] BTRFS: selftest: running btrfs_get_extent tests [ 9415.127909] BTRFS: selftest: running hole first btrfs_get_extent test [ 9415.128343] BTRFS critical (device (efault)): regular/prealloc extent found for non-regular inode 256 [ 9415.131428] BTRFS: selftest: fs/btrfs/tests/inode-tests.c:904 expected a real extent, got 0 This happens because the test inodes are created without ever initializing the i_mode field of the inode, and neither VFS's new_inode() nor the btrfs callback btrfs_alloc_inode() initialize the i_mode. Initialization of the i_mode is done through the various callbacks used by the VFS to create new inodes (regular files, directories, symlinks, tmpfiles, etc), which all call btrfs_new_inode() which in turn calls inode_init_owner(), which sets the inode's i_mode. Since the tests only uses new_inode() to create the test inodes, the i_mode was never initialized. This always happens on a VM I used with kasan, slub_debug and many other debug facilities enabled. It also happened to someone who reported this on bugzilla (on a 5.3-rc). Fix this by setting i_mode to S_IFREG at btrfs_new_test_inode(). Fixes: 6bf9e4bd6a2778 ("btrfs: inode: Verify inode mode to avoid NULL pointer dereference") Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=204397 Signed-off-by: Filipe Manana Reviewed-by: Qu Wenruo Signed-off-by: David Sterba Signed-off-by: Sudip Mukherjee Signed-off-by: Greg Kroah-Hartman --- fs/btrfs/tests/btrfs-tests.c |8 +++- 1 file changed, 7 insertions(+), 1 deletion(-) --- a/fs/btrfs/tests/btrfs-tests.c +++ b/fs/btrfs/tests/btrfs-tests.c @@ -51,7 +51,13 @@ static struct file_system_type test_type struct inode *btrfs_new_test_inode(void) { - return new_inode(test_mnt->mnt_sb); + struct inode *inode; + + inode = new_inode(test_mnt->mnt_sb); + if (inode) + inode_init_owner(inode, NULL, S_IFREG); + + return inode; } static int btrfs_init_test_fs(void)
[PATCH 4.14 206/242] Btrfs: fix selftests failure due to uninitialized i_mode in test inodes
From: Filipe Manana commit 9f7fec0ba89108b9385f1b9fb167861224912a4a upstream Some of the self tests create a test inode, setup some extents and then do calls to btrfs_get_extent() to test that the corresponding extent maps exist and are correct. However btrfs_get_extent(), since the 5.2 merge window, now errors out when it finds a regular or prealloc extent for an inode that does not correspond to a regular file (its ->i_mode is not S_IFREG). This causes the self tests to fail sometimes, specially when KASAN, slub_debug and page poisoning are enabled: $ modprobe btrfs modprobe: ERROR: could not insert 'btrfs': Invalid argument $ dmesg [ 9414.691648] Btrfs loaded, crc32c=crc32c-intel, debug=on, assert=on, integrity-checker=on, ref-verify=on [ 9414.692655] BTRFS: selftest: sectorsize: 4096 nodesize: 4096 [ 9414.692658] BTRFS: selftest: running btrfs free space cache tests [ 9414.692918] BTRFS: selftest: running extent only tests [ 9414.693061] BTRFS: selftest: running bitmap only tests [ 9414.693366] BTRFS: selftest: running bitmap and extent tests [ 9414.696455] BTRFS: selftest: running space stealing from bitmap to extent tests [ 9414.697131] BTRFS: selftest: running extent buffer operation tests [ 9414.697133] BTRFS: selftest: running btrfs_split_item tests [ 9414.697564] BTRFS: selftest: running extent I/O tests [ 9414.697583] BTRFS: selftest: running find delalloc tests [ 9415.081125] BTRFS: selftest: running find_first_clear_extent_bit test [ 9415.081278] BTRFS: selftest: running extent buffer bitmap tests [ 9415.124192] BTRFS: selftest: running inode tests [ 9415.124195] BTRFS: selftest: running btrfs_get_extent tests [ 9415.127909] BTRFS: selftest: running hole first btrfs_get_extent test [ 9415.128343] BTRFS critical (device (efault)): regular/prealloc extent found for non-regular inode 256 [ 9415.131428] BTRFS: selftest: fs/btrfs/tests/inode-tests.c:904 expected a real extent, got 0 This happens because the test inodes are created without ever initializing the i_mode field of the inode, and neither VFS's new_inode() nor the btrfs callback btrfs_alloc_inode() initialize the i_mode. Initialization of the i_mode is done through the various callbacks used by the VFS to create new inodes (regular files, directories, symlinks, tmpfiles, etc), which all call btrfs_new_inode() which in turn calls inode_init_owner(), which sets the inode's i_mode. Since the tests only uses new_inode() to create the test inodes, the i_mode was never initialized. This always happens on a VM I used with kasan, slub_debug and many other debug facilities enabled. It also happened to someone who reported this on bugzilla (on a 5.3-rc). Fix this by setting i_mode to S_IFREG at btrfs_new_test_inode(). Fixes: 6bf9e4bd6a2778 ("btrfs: inode: Verify inode mode to avoid NULL pointer dereference") Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=204397 Signed-off-by: Filipe Manana Reviewed-by: Qu Wenruo Signed-off-by: David Sterba Signed-off-by: Sudip Mukherjee Signed-off-by: Greg Kroah-Hartman --- fs/btrfs/tests/btrfs-tests.c |8 +++- 1 file changed, 7 insertions(+), 1 deletion(-) --- a/fs/btrfs/tests/btrfs-tests.c +++ b/fs/btrfs/tests/btrfs-tests.c @@ -51,7 +51,13 @@ static struct file_system_type test_type struct inode *btrfs_new_test_inode(void) { - return new_inode(test_mnt->mnt_sb); + struct inode *inode; + + inode = new_inode(test_mnt->mnt_sb); + if (inode) + inode_init_owner(inode, NULL, S_IFREG); + + return inode; } static int btrfs_init_test_fs(void)
[PATCH 4.19 298/346] ext4: fix deadlock with fs freezing and EA inodes
From: Jan Kara commit 46e294efc355c48d1dd4d58501aa56dac461792a upstream. Xattr code using inodes with large xattr data can end up dropping last inode reference (and thus deleting the inode) from places like ext4_xattr_set_entry(). That function is called with transaction started and so ext4_evict_inode() can deadlock against fs freezing like: CPU1CPU2 removexattr() freeze_super() vfs_removexattr() ext4_xattr_set() handle = ext4_journal_start() ... ext4_xattr_set_entry() iput(old_ea_inode) ext4_evict_inode(old_ea_inode) sb->s_writers.frozen = SB_FREEZE_FS; sb_wait_write(sb, SB_FREEZE_FS); ext4_freeze() jbd2_journal_lock_updates() -> blocks waiting for all handles to stop sb_start_intwrite() -> blocks as sb is already in SB_FREEZE_FS state Generally it is advisable to delete inodes from a separate transaction as it can consume quite some credits however in this case it would be quite clumsy and furthermore the credits for inode deletion are quite limited and already accounted for. So just tweak ext4_evict_inode() to avoid freeze protection if we have transaction already started and thus it is not really needed anyway. Cc: sta...@vger.kernel.org Fixes: dec214d00e0d ("ext4: xattr inode deduplication") Signed-off-by: Jan Kara Reviewed-by: Andreas Dilger Link: https://lore.kernel.org/r/20201127110649.24730-1-j...@suse.cz Signed-off-by: Theodore Ts'o Signed-off-by: Greg Kroah-Hartman --- fs/ext4/inode.c | 21 +++-- 1 file changed, 15 insertions(+), 6 deletions(-) --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -203,6 +203,7 @@ void ext4_evict_inode(struct inode *inod */ int extra_credits = 6; struct ext4_xattr_inode_array *ea_inode_array = NULL; + bool freeze_protected = false; trace_ext4_evict_inode(inode); @@ -250,9 +251,14 @@ void ext4_evict_inode(struct inode *inod /* * Protect us against freezing - iput() caller didn't have to have any -* protection against it -*/ - sb_start_intwrite(inode->i_sb); +* protection against it. When we are in a running transaction though, +* we are already protected against freezing and we cannot grab further +* protection due to lock ordering constraints. +*/ + if (!ext4_journal_current_handle()) { + sb_start_intwrite(inode->i_sb); + freeze_protected = true; + } if (!IS_NOQUOTA(inode)) extra_credits += EXT4_MAXQUOTAS_DEL_BLOCKS(inode->i_sb); @@ -271,7 +277,8 @@ void ext4_evict_inode(struct inode *inod * cleaned up. */ ext4_orphan_del(NULL, inode); - sb_end_intwrite(inode->i_sb); + if (freeze_protected) + sb_end_intwrite(inode->i_sb); goto no_delete; } @@ -312,7 +319,8 @@ void ext4_evict_inode(struct inode *inod stop_handle: ext4_journal_stop(handle); ext4_orphan_del(NULL, inode); - sb_end_intwrite(inode->i_sb); + if (freeze_protected) + sb_end_intwrite(inode->i_sb); ext4_xattr_inode_array_free(ea_inode_array); goto no_delete; } @@ -341,7 +349,8 @@ stop_handle: else ext4_free_inode(handle, inode); ext4_journal_stop(handle); - sb_end_intwrite(inode->i_sb); + if (freeze_protected) + sb_end_intwrite(inode->i_sb); ext4_xattr_inode_array_free(ea_inode_array); return; no_delete:
[PATCH 5.4 375/453] ext4: fix deadlock with fs freezing and EA inodes
From: Jan Kara commit 46e294efc355c48d1dd4d58501aa56dac461792a upstream. Xattr code using inodes with large xattr data can end up dropping last inode reference (and thus deleting the inode) from places like ext4_xattr_set_entry(). That function is called with transaction started and so ext4_evict_inode() can deadlock against fs freezing like: CPU1CPU2 removexattr() freeze_super() vfs_removexattr() ext4_xattr_set() handle = ext4_journal_start() ... ext4_xattr_set_entry() iput(old_ea_inode) ext4_evict_inode(old_ea_inode) sb->s_writers.frozen = SB_FREEZE_FS; sb_wait_write(sb, SB_FREEZE_FS); ext4_freeze() jbd2_journal_lock_updates() -> blocks waiting for all handles to stop sb_start_intwrite() -> blocks as sb is already in SB_FREEZE_FS state Generally it is advisable to delete inodes from a separate transaction as it can consume quite some credits however in this case it would be quite clumsy and furthermore the credits for inode deletion are quite limited and already accounted for. So just tweak ext4_evict_inode() to avoid freeze protection if we have transaction already started and thus it is not really needed anyway. Cc: sta...@vger.kernel.org Fixes: dec214d00e0d ("ext4: xattr inode deduplication") Signed-off-by: Jan Kara Reviewed-by: Andreas Dilger Link: https://lore.kernel.org/r/20201127110649.24730-1-j...@suse.cz Signed-off-by: Theodore Ts'o Signed-off-by: Greg Kroah-Hartman --- fs/ext4/inode.c | 21 +++-- 1 file changed, 15 insertions(+), 6 deletions(-) --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -203,6 +203,7 @@ void ext4_evict_inode(struct inode *inod */ int extra_credits = 6; struct ext4_xattr_inode_array *ea_inode_array = NULL; + bool freeze_protected = false; trace_ext4_evict_inode(inode); @@ -250,9 +251,14 @@ void ext4_evict_inode(struct inode *inod /* * Protect us against freezing - iput() caller didn't have to have any -* protection against it -*/ - sb_start_intwrite(inode->i_sb); +* protection against it. When we are in a running transaction though, +* we are already protected against freezing and we cannot grab further +* protection due to lock ordering constraints. +*/ + if (!ext4_journal_current_handle()) { + sb_start_intwrite(inode->i_sb); + freeze_protected = true; + } if (!IS_NOQUOTA(inode)) extra_credits += EXT4_MAXQUOTAS_DEL_BLOCKS(inode->i_sb); @@ -271,7 +277,8 @@ void ext4_evict_inode(struct inode *inod * cleaned up. */ ext4_orphan_del(NULL, inode); - sb_end_intwrite(inode->i_sb); + if (freeze_protected) + sb_end_intwrite(inode->i_sb); goto no_delete; } @@ -312,7 +319,8 @@ void ext4_evict_inode(struct inode *inod stop_handle: ext4_journal_stop(handle); ext4_orphan_del(NULL, inode); - sb_end_intwrite(inode->i_sb); + if (freeze_protected) + sb_end_intwrite(inode->i_sb); ext4_xattr_inode_array_free(ea_inode_array); goto no_delete; } @@ -341,7 +349,8 @@ stop_handle: else ext4_free_inode(handle, inode); ext4_journal_stop(handle); - sb_end_intwrite(inode->i_sb); + if (freeze_protected) + sb_end_intwrite(inode->i_sb); ext4_xattr_inode_array_free(ea_inode_array); return; no_delete:
[PATCH 5.10 592/717] ext4: fix deadlock with fs freezing and EA inodes
From: Jan Kara commit 46e294efc355c48d1dd4d58501aa56dac461792a upstream. Xattr code using inodes with large xattr data can end up dropping last inode reference (and thus deleting the inode) from places like ext4_xattr_set_entry(). That function is called with transaction started and so ext4_evict_inode() can deadlock against fs freezing like: CPU1CPU2 removexattr() freeze_super() vfs_removexattr() ext4_xattr_set() handle = ext4_journal_start() ... ext4_xattr_set_entry() iput(old_ea_inode) ext4_evict_inode(old_ea_inode) sb->s_writers.frozen = SB_FREEZE_FS; sb_wait_write(sb, SB_FREEZE_FS); ext4_freeze() jbd2_journal_lock_updates() -> blocks waiting for all handles to stop sb_start_intwrite() -> blocks as sb is already in SB_FREEZE_FS state Generally it is advisable to delete inodes from a separate transaction as it can consume quite some credits however in this case it would be quite clumsy and furthermore the credits for inode deletion are quite limited and already accounted for. So just tweak ext4_evict_inode() to avoid freeze protection if we have transaction already started and thus it is not really needed anyway. Cc: sta...@vger.kernel.org Fixes: dec214d00e0d ("ext4: xattr inode deduplication") Signed-off-by: Jan Kara Reviewed-by: Andreas Dilger Link: https://lore.kernel.org/r/20201127110649.24730-1-j...@suse.cz Signed-off-by: Theodore Ts'o Signed-off-by: Greg Kroah-Hartman --- fs/ext4/inode.c | 21 +++-- 1 file changed, 15 insertions(+), 6 deletions(-) --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -175,6 +175,7 @@ void ext4_evict_inode(struct inode *inod */ int extra_credits = 6; struct ext4_xattr_inode_array *ea_inode_array = NULL; + bool freeze_protected = false; trace_ext4_evict_inode(inode); @@ -232,9 +233,14 @@ void ext4_evict_inode(struct inode *inod /* * Protect us against freezing - iput() caller didn't have to have any -* protection against it -*/ - sb_start_intwrite(inode->i_sb); +* protection against it. When we are in a running transaction though, +* we are already protected against freezing and we cannot grab further +* protection due to lock ordering constraints. +*/ + if (!ext4_journal_current_handle()) { + sb_start_intwrite(inode->i_sb); + freeze_protected = true; + } if (!IS_NOQUOTA(inode)) extra_credits += EXT4_MAXQUOTAS_DEL_BLOCKS(inode->i_sb); @@ -253,7 +259,8 @@ void ext4_evict_inode(struct inode *inod * cleaned up. */ ext4_orphan_del(NULL, inode); - sb_end_intwrite(inode->i_sb); + if (freeze_protected) + sb_end_intwrite(inode->i_sb); goto no_delete; } @@ -294,7 +301,8 @@ void ext4_evict_inode(struct inode *inod stop_handle: ext4_journal_stop(handle); ext4_orphan_del(NULL, inode); - sb_end_intwrite(inode->i_sb); + if (freeze_protected) + sb_end_intwrite(inode->i_sb); ext4_xattr_inode_array_free(ea_inode_array); goto no_delete; } @@ -323,7 +331,8 @@ stop_handle: else ext4_free_inode(handle, inode); ext4_journal_stop(handle); - sb_end_intwrite(inode->i_sb); + if (freeze_protected) + sb_end_intwrite(inode->i_sb); ext4_xattr_inode_array_free(ea_inode_array); return; no_delete:
[PATCH 4.14 209/242] ext4: fix deadlock with fs freezing and EA inodes
From: Jan Kara commit 46e294efc355c48d1dd4d58501aa56dac461792a upstream. Xattr code using inodes with large xattr data can end up dropping last inode reference (and thus deleting the inode) from places like ext4_xattr_set_entry(). That function is called with transaction started and so ext4_evict_inode() can deadlock against fs freezing like: CPU1CPU2 removexattr() freeze_super() vfs_removexattr() ext4_xattr_set() handle = ext4_journal_start() ... ext4_xattr_set_entry() iput(old_ea_inode) ext4_evict_inode(old_ea_inode) sb->s_writers.frozen = SB_FREEZE_FS; sb_wait_write(sb, SB_FREEZE_FS); ext4_freeze() jbd2_journal_lock_updates() -> blocks waiting for all handles to stop sb_start_intwrite() -> blocks as sb is already in SB_FREEZE_FS state Generally it is advisable to delete inodes from a separate transaction as it can consume quite some credits however in this case it would be quite clumsy and furthermore the credits for inode deletion are quite limited and already accounted for. So just tweak ext4_evict_inode() to avoid freeze protection if we have transaction already started and thus it is not really needed anyway. Cc: sta...@vger.kernel.org Fixes: dec214d00e0d ("ext4: xattr inode deduplication") Signed-off-by: Jan Kara Reviewed-by: Andreas Dilger Link: https://lore.kernel.org/r/20201127110649.24730-1-j...@suse.cz Signed-off-by: Theodore Ts'o Signed-off-by: Greg Kroah-Hartman --- fs/ext4/inode.c | 21 +++-- 1 file changed, 15 insertions(+), 6 deletions(-) --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -202,6 +202,7 @@ void ext4_evict_inode(struct inode *inod */ int extra_credits = 6; struct ext4_xattr_inode_array *ea_inode_array = NULL; + bool freeze_protected = false; trace_ext4_evict_inode(inode); @@ -249,9 +250,14 @@ void ext4_evict_inode(struct inode *inod /* * Protect us against freezing - iput() caller didn't have to have any -* protection against it -*/ - sb_start_intwrite(inode->i_sb); +* protection against it. When we are in a running transaction though, +* we are already protected against freezing and we cannot grab further +* protection due to lock ordering constraints. +*/ + if (!ext4_journal_current_handle()) { + sb_start_intwrite(inode->i_sb); + freeze_protected = true; + } if (!IS_NOQUOTA(inode)) extra_credits += EXT4_MAXQUOTAS_DEL_BLOCKS(inode->i_sb); @@ -270,7 +276,8 @@ void ext4_evict_inode(struct inode *inod * cleaned up. */ ext4_orphan_del(NULL, inode); - sb_end_intwrite(inode->i_sb); + if (freeze_protected) + sb_end_intwrite(inode->i_sb); goto no_delete; } @@ -311,7 +318,8 @@ void ext4_evict_inode(struct inode *inod stop_handle: ext4_journal_stop(handle); ext4_orphan_del(NULL, inode); - sb_end_intwrite(inode->i_sb); + if (freeze_protected) + sb_end_intwrite(inode->i_sb); ext4_xattr_inode_array_free(ea_inode_array); goto no_delete; } @@ -340,7 +348,8 @@ stop_handle: else ext4_free_inode(handle, inode); ext4_journal_stop(handle); - sb_end_intwrite(inode->i_sb); + if (freeze_protected) + sb_end_intwrite(inode->i_sb); ext4_xattr_inode_array_free(ea_inode_array); return; no_delete:
[PATCH AUTOSEL 5.4 101/130] btrfs: fix race that causes unnecessary logging of ancestor inodes
From: Filipe Manana [ Upstream commit 4d6221d7d83141d58ece6560e9cfd4cc92eab044 ] When logging an inode and we are checking if we need to log ancestors that are new, if the previous transaction is still committing we have a time window where we can unnecessarily log ancestor inodes that were created in the previous transaction. The race is described by the following steps: 1) We are at transaction 1000; 2) Directory inode X is created, its generation is set to 1000; 3) The commit for transaction 1000 is started by task A; 4) The task committing transaction 1000 sets the transaction state to unblocked, writes the dirty extent buffers and the super blocks, then unlocks tree_log_mutex; 5) Inode Y, a regular file, is created under directory inode X, this results in starting a new transaction with a generation of 1001; 6) The transaction 1000 commit is unpinning extents. At this point fs_info->last_trans_committed still has a value of 999; 7) Task B calls fsync on inode Y and gets a handle for transaction 1001; 8) Task B ends up at log_all_new_ancestors() and then because inode Y has only one hard link, ends up at log_new_ancestors_fast(). There it reads a value of 999 from fs_info->last_trans_committed, and sees that the parent inode X has a generation of 1000, so we end up logging inode X: if (inode->generation > fs_info->last_trans_committed) { ret = btrfs_log_inode(trans, root, inode, LOG_INODE_EXISTS, ctx); (...) which is not necessary since it was created in the past transaction, with a generation of 1000, and that transaction has already committed its super blocks - it's still unpinning extents so it has not yet updated fs_info->last_trans_committed from 999 to 1000. So this just causes us to spend more time logging and allocating and writing more tree blocks for the log tree. So fix this by comparing an inode's generation with the generation of the transaction our transaction handle refers to - if the inode's generation matches the generation of the current transaction than we know it is a new inode we need to log, otherwise don't log it. This case is often hit when running dbench for a long enough duration. This patch belongs to a patch set that is comprised of the following patches: btrfs: fix race causing unnecessary inode logging during link and rename btrfs: fix race that results in logging old extents during a fast fsync btrfs: fix race that causes unnecessary logging of ancestor inodes btrfs: fix race that makes inode logging fallback to transaction commit btrfs: fix race leading to unnecessary transaction commit when logging inode btrfs: do not block inode logging for so long during transaction commit Performance results are mentioned in the change log of the last patch. Signed-off-by: Filipe Manana Signed-off-by: David Sterba Signed-off-by: Sasha Levin --- fs/btrfs/tree-log.c | 6 ++ 1 file changed, 2 insertions(+), 4 deletions(-) diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c index 12182db88222b..72e0ff38646a7 100644 --- a/fs/btrfs/tree-log.c +++ b/fs/btrfs/tree-log.c @@ -5804,7 +5804,6 @@ static int log_new_ancestors(struct btrfs_trans_handle *trans, while (true) { struct btrfs_fs_info *fs_info = root->fs_info; - const u64 last_committed = fs_info->last_trans_committed; struct extent_buffer *leaf = path->nodes[0]; int slot = path->slots[0]; struct btrfs_key search_key; @@ -5820,7 +5819,7 @@ static int log_new_ancestors(struct btrfs_trans_handle *trans, if (IS_ERR(inode)) return PTR_ERR(inode); - if (BTRFS_I(inode)->generation > last_committed) + if (BTRFS_I(inode)->generation >= trans->transid) ret = btrfs_log_inode(trans, root, BTRFS_I(inode), LOG_INODE_EXISTS, 0, LLONG_MAX, ctx); @@ -5862,7 +5861,6 @@ static int log_new_ancestors_fast(struct btrfs_trans_handle *trans, struct btrfs_log_ctx *ctx) { struct btrfs_root *root = inode->root; - struct btrfs_fs_info *fs_info = root->fs_info; struct dentry *old_parent = NULL; struct super_block *sb = inode->vfs_inode.i_sb; int ret = 0; @@ -5876,7 +5874,7 @@ static int log_new_ancestors_fast(struct btrfs_trans_handle *trans, if (root != inode->root) break; - if (inode->generation > fs_info->last_trans_committed) { + if (inode->generation >= trans->transid) { ret = btrfs_log_inode(trans, root, inode, LOG_INODE_EXISTS, 0, LLONG_MAX, ctx); if (ret) -- 2.27.0
Re: [PATCH] ecryptfs: Fix inodes not being evicted until unmount
Steps to reproduce issue: Mount nfs Inside the nfs mount, create back and front directories for ecryptfs Use ecryptfs to mount one directory onto the other Create a few files inside the ecryptfs mount Set ftrace to monitor ecryptfs_do_unlink() and ecryptfs_evict_inode() Delete files Unmount ecryptfs ftrace output before patch: # tracer: function_graph # # CPU DURATION FUNCTION CALLS # | | | | | | | 1) | /* Tracing enabled */ 1) | /* Writing to files */ 1) | /* Unlinking first file */ 0) * 27804.34 us | ecryptfs_do_unlink(); 1) | /* rm'ing second file */ -- 0) unlink-1384 =>rm-1385 -- 0) * 32828.63 us | ecryptfs_do_unlink(); 1) | /* Unmounting eCryptFS */ -- 1) trace-a-1365 => umount-1387 -- 1) 2.408 us| ecryptfs_evict_inode(); 1) 8.914 us| ecryptfs_evict_inode(); 1) 3.344 us| ecryptfs_evict_inode(); ftrace output after patch: # tracer: function_graph # # CPU DURATION FUNCTION CALLS # | | | | | | | 0) | /* Tracing enabled */ 0) | /* Writing to files */ 0) | /* Unlinking first file */ 1) * 24728.81 us | ecryptfs_do_unlink(); 1) + 20.923 us | ecryptfs_evict_inode(); 0) | /* rm'ing second file */ -- 1) unlink-1333 =>rm-1334 -- 1) * 41093.71 us | ecryptfs_do_unlink(); 1) + 11.384 us | ecryptfs_evict_inode(); 0) | /* Unmounting eCryptFS */ -- 0) trace-a-1314 => umount-1336 -- 0) 2.986 us| ecryptfs_evict_inode(); On Fri, Dec 18, 2020 at 01:07:30PM -0600, Jeffrey Mitchell wrote: > On asynchronous base filesystems like NFS, eCryptFS leaves inodes for > deleted files in the cache until unmounting. Change call in > ecryptfs_do_unlink() from set_nlink() to drop_nlink() in order to reliably > evict inodes from the cache even on top of NFS. > > Signed-off-by: Dan Robertson > Signed-off-by: Jeffrey Mitchell > --- > fs/ecryptfs/inode.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/fs/ecryptfs/inode.c b/fs/ecryptfs/inode.c > index e23752d..f7594b6 100644 > --- a/fs/ecryptfs/inode.c > +++ b/fs/ecryptfs/inode.c > @@ -147,7 +147,7 @@ static int ecryptfs_do_unlink(struct inode *dir, struct > dentry *dentry, > goto out_unlock; > } > fsstack_copy_attr_times(dir, lower_dir_inode); > - set_nlink(inode, ecryptfs_inode_to_lower(inode)->i_nlink); > + drop_nlink(inode); > inode->i_ctime = dir->i_ctime; > out_unlock: > dput(lower_dentry); > -- > 2.7.4 >
[PATCH] ecryptfs: Fix inodes not being evicted until unmount
On asynchronous base filesystems like NFS, eCryptFS leaves inodes for deleted files in the cache until unmounting. Change call in ecryptfs_do_unlink() from set_nlink() to drop_nlink() in order to reliably evict inodes from the cache even on top of NFS. Signed-off-by: Dan Robertson Signed-off-by: Jeffrey Mitchell --- fs/ecryptfs/inode.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/ecryptfs/inode.c b/fs/ecryptfs/inode.c index e23752d..f7594b6 100644 --- a/fs/ecryptfs/inode.c +++ b/fs/ecryptfs/inode.c @@ -147,7 +147,7 @@ static int ecryptfs_do_unlink(struct inode *dir, struct dentry *dentry, goto out_unlock; } fsstack_copy_attr_times(dir, lower_dir_inode); - set_nlink(inode, ecryptfs_inode_to_lower(inode)->i_nlink); + drop_nlink(inode); inode->i_ctime = dir->i_ctime; out_unlock: dput(lower_dentry); -- 2.7.4
Re: WARNING: filesystem loop0 was created with 512 inodes, the real maximum is 511, mounting anyway
On Mon, Sep 28, 2020 at 11:08 AM Tigran Aivazian wrote: > > On Mon, 28 Sep 2020 at 09:29, Dmitry Vyukov wrote: > > On Mon, Sep 28, 2020 at 10:23 AM Tigran Aivazian > > > No, this is not an issue. In the latest change to BFS I added the > > > following comment to the header fs/bfs/bfs.h, which explains it: > > > > > > /* In theory BFS supports up to 512 inodes, numbered from 2 (for /) up > > > to 513 inclusive. > > >In actual fact, attempting to create the 512th inode (i.e. inode > > > No. 513 or file No. 511) > > >will fail with ENOSPC in bfs_add_entry(): the root directory cannot > > > contain so many entries, counting '..'. > > >So, mkfs.bfs(8) should really limit its -N option to 511 and not > > > 512. For now, we just print a warning > > >if a filesystem is mounted with such "impossible to fill up" number > > > of inodes */ > > > > There are rules for use of "WARNING" in output required to support > > kernel testing: > > https://github.com/torvalds/linux/blob/master/include/asm-generic/bug.h#L67-L80 > > This seems to be triggerable by exteranal inputs and breaks these rules. > > Thank you, I didn't know about these rules. Ok, then, since this > warning does not "need prompt attention if it should ever occur at > runtime", the easiest solution is to change "WARNING" to lower case > "warning" in that printk in fs/bfs/inode.c: > > --- fs/bfs/inode.c.0 2020-09-28 10:03:00.658549556 +0100 > +++ fs/bfs/inode.c 2020-09-28 10:03:05.408548250 +0100 > @@ -351,7 +351,7 @@ > > info->si_lasti = (le32_to_cpu(bfs_sb->s_start) - BFS_BSIZE) / > sizeof(struct bfs_inode) + BFS_ROOT_INO - 1; > if (info->si_lasti == BFS_MAX_LASTI) > - printf("WARNING: filesystem %s was created with 512 inodes, the real > maximum is 511, mounting anyway\n", s->s_id); > + printf("warning: filesystem %s was created with 512 inodes, the real > maximum is 511, mounting anyway\n", s->s_id); > else if (info->si_lasti > BFS_MAX_LASTI) { > printf("Impossible last inode number %lu > %d on %s\n", > info->si_lasti, BFS_MAX_LASTI, s->s_id); > goto out1; > > If you want to submit this patch to the appropriate place(s), feel > free to do this -- I approve it. If the comment in asm/bug.h is > inaccurate and its mention of "BUG/WARNING" implies the lowercase > "bug/warning" also, then one can remove the prefix "warning: " from > the patch altogether and proper case "filesystem" to "Filesystem". > > Kind regards, > Tigran > > Acked-By: Tigran Aivazian > Approved-By: Tigran Aivazian #syz fix: bfs: don't use WARNING: string when it's just info.
Re: WARNING: filesystem loop1 was created with 512 inodes, the real maximum is 511, mounting anyway
#syz fix: bfs: don't use WARNING: string when it's just info. On Mon, Sep 28, 2020 at 8:10 PM syzbot wrote: > > Hello, > > syzbot found the following issue on: > > HEAD commit:c9c9e6a4 Merge tag 'trace-v5.9-rc5-2' of git://git.kernel... > git tree: upstream > console output: https://syzkaller.appspot.com/x/log.txt?x=138b09d990 > kernel config: https://syzkaller.appspot.com/x/.config?x=5f4c828c9e3cef97 > dashboard link: https://syzkaller.appspot.com/bug?extid=2435de7315366e15f0ca > compiler: gcc (GCC) 10.1.0-syz 20200507 > > Unfortunately, I don't have any reproducer for this issue yet. > > IMPORTANT: if you fix the issue, please add the following tag to the commit: > Reported-by: syzbot+2435de7315366e15f...@syzkaller.appspotmail.com > > BFS-fs: bfs_fill_super(): WARNING: filesystem loop1 was created with 512 > inodes, the real maximum is 511, mounting anyway > BFS-fs: bfs_fill_super(): Inode 0x0002 corrupted on loop1 > BFS-fs: bfs_fill_super(): WARNING: filesystem loop1 was created with 512 > inodes, the real maximum is 511, mounting anyway > BFS-fs: bfs_fill_super(): Inode 0x0002 corrupted on loop1 > > > --- > This report is generated by a bot. It may contain errors. > See https://goo.gl/tpsmEJ for more information about syzbot. > syzbot engineers can be reached at syzkal...@googlegroups.com. > > syzbot will keep track of this issue. See: > https://goo.gl/tpsmEJ#status for how to communicate with syzbot. > > -- > You received this message because you are subscribed to the Google Groups > "syzkaller-bugs" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to syzkaller-bugs+unsubscr...@googlegroups.com. > To view this discussion on the web visit > https://groups.google.com/d/msgid/syzkaller-bugs/6f1f5d05b06394d0%40google.com.
Re: WARNING: filesystem loop4 was created with 512 inodes, the real maximum is 511, mounting anyway
#syz fix: bfs: don't use WARNING: string when it's just info. On Sat, Nov 21, 2020 at 8:33 AM syzbot wrote: > > Hello, > > syzbot found the following issue on: > > HEAD commit:09162bc3 Linux 5.10-rc4 > git tree: upstream > console output: https://syzkaller.appspot.com/x/log.txt?x=103f4fbe50 > kernel config: https://syzkaller.appspot.com/x/.config?x=75292221eb79ace2 > dashboard link: https://syzkaller.appspot.com/bug?extid=1a219abc12077a390bc9 > compiler: gcc (GCC) 10.1.0-syz 20200507 > > Unfortunately, I don't have any reproducer for this issue yet. > > IMPORTANT: if you fix the issue, please add the following tag to the commit: > Reported-by: syzbot+1a219abc12077a390...@syzkaller.appspotmail.com > > BFS-fs: bfs_fill_super(): WARNING: filesystem loop4 was created with 512 > inodes, the real maximum is 511, mounting anyway > BFS-fs: bfs_fill_super(): Last block not available on loop4: 1507328 > BFS-fs: bfs_fill_super(): WARNING: filesystem loop4 was created with 512 > inodes, the real maximum is 511, mounting anyway > BFS-fs: bfs_fill_super(): Last block not available on loop4: 1507328 > > > --- > This report is generated by a bot. It may contain errors. > See https://goo.gl/tpsmEJ for more information about syzbot. > syzbot engineers can be reached at syzkal...@googlegroups.com. > > syzbot will keep track of this issue. See: > https://goo.gl/tpsmEJ#status for how to communicate with syzbot. > > -- > You received this message because you are subscribed to the Google Groups > "syzkaller-bugs" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to syzkaller-bugs+unsubscr...@googlegroups.com. > To view this discussion on the web visit > https://groups.google.com/d/msgid/syzkaller-bugs/bf566005b498f95f%40google.com.
Re: WARNING: filesystem loop2 was created with 512 inodes, the real maximum is 511, mounting anyway
#syz fix: bfs: don't use WARNING: string when it's just info. On Sat, Nov 21, 2020 at 8:33 AM syzbot wrote: > > Hello, > > syzbot found the following issue on: > > HEAD commit:09162bc3 Linux 5.10-rc4 > git tree: upstream > console output: https://syzkaller.appspot.com/x/log.txt?x=16e9a48650 > kernel config: https://syzkaller.appspot.com/x/.config?x=e93bbe4ce29223b > dashboard link: https://syzkaller.appspot.com/bug?extid=ae3ff0bb2a0133596a5b > compiler: clang version 11.0.0 > (https://github.com/llvm/llvm-project.git > ca2dcbd030eadbf0aa9b660efe864ff08af6e18b) > > Unfortunately, I don't have any reproducer for this issue yet. > > IMPORTANT: if you fix the issue, please add the following tag to the commit: > Reported-by: syzbot+ae3ff0bb2a0133596...@syzkaller.appspotmail.com > > BFS-fs: bfs_fill_super(): WARNING: filesystem loop2 was created with 512 > inodes, the real maximum is 511, mounting anyway > BFS-fs: bfs_fill_super(): Last block not available on loop2: 1507328 > BFS-fs: bfs_fill_super(): WARNING: filesystem loop2 was created with 512 > inodes, the real maximum is 511, mounting anyway > BFS-fs: bfs_fill_super(): Last block not available on loop2: 1507328 > > > --- > This report is generated by a bot. It may contain errors. > See https://goo.gl/tpsmEJ for more information about syzbot. > syzbot engineers can be reached at syzkal...@googlegroups.com. > > syzbot will keep track of this issue. See: > https://goo.gl/tpsmEJ#status for how to communicate with syzbot. > > -- > You received this message because you are subscribed to the Google Groups > "syzkaller-bugs" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to syzkaller-bugs+unsubscr...@googlegroups.com. > To view this discussion on the web visit > https://groups.google.com/d/msgid/syzkaller-bugs/c2f72c05b498f9bd%40google.com.
Re: WARNING: filesystem loop3 was created with 512 inodes, the real maximum is 511, mounting anyway
#syz fix: bfs: don't use WARNING: string when it's just info. On Thu, Sep 24, 2020 at 11:40 AM syzbot wrote: > > Hello, > > syzbot found the following issue on: > > HEAD commit:98477740 Merge branch 'rcu/urgent' of git://git.kernel.org.. > git tree: upstream > console output: https://syzkaller.appspot.com/x/log.txt?x=15964ec390 > kernel config: https://syzkaller.appspot.com/x/.config?x=6f192552d75898a1 > dashboard link: https://syzkaller.appspot.com/bug?extid=293714df4fe354fae488 > compiler: gcc (GCC) 10.1.0-syz 20200507 > > Unfortunately, I don't have any reproducer for this issue yet. > > IMPORTANT: if you fix the issue, please add the following tag to the commit: > Reported-by: syzbot+293714df4fe354fae...@syzkaller.appspotmail.com > > BFS-fs: bfs_fill_super(): WARNING: filesystem loop3 was created with 512 > inodes, the real maximum is 511, mounting anyway > BFS-fs: bfs_fill_super(): Inode 0x0002 corrupted on loop3 > > > --- > This report is generated by a bot. It may contain errors. > See https://goo.gl/tpsmEJ for more information about syzbot. > syzbot engineers can be reached at syzkal...@googlegroups.com. > > syzbot will keep track of this issue. See: > https://goo.gl/tpsmEJ#status for how to communicate with syzbot. > > -- > You received this message because you are subscribed to the Google Groups > "syzkaller-bugs" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to syzkaller-bugs+unsubscr...@googlegroups.com. > To view this discussion on the web visit > https://groups.google.com/d/msgid/syzkaller-bugs/28201305b00bfdba%40google.com.
Re: WARNING: filesystem loop0 was created with 512 inodes, the real maximum is 511, mounting anywa
#syz fix: bfs: don't use WARNING: string when it's just info. On Mon, Dec 7, 2020 at 1:53 PM syzbot wrote: > > Hello, > > syzbot found the following issue on: > > HEAD commit:34816d20 Merge tag 'gfs2-v5.10-rc5-fixes' of git://git.ker.. > git tree: upstream > console output: https://syzkaller.appspot.com/x/log.txt?x=157dad0750 > kernel config: https://syzkaller.appspot.com/x/.config?x=b3a044ccf5b03ac4 > dashboard link: https://syzkaller.appspot.com/bug?extid=02c44c7f92e70a73730a > compiler: gcc (GCC) 10.1.0-syz 20200507 > syz repro: https://syzkaller.appspot.com/x/repro.syz?x=152b05ab50 > C reproducer: https://syzkaller.appspot.com/x/repro.c?x=14fc3fad50 > > IMPORTANT: if you fix the issue, please add the following tag to the commit: > Reported-by: syzbot+02c44c7f92e70a737...@syzkaller.appspotmail.com > > BFS-fs: bfs_fill_super(): WARNING: filesystem loop0 was created with 512 > inodes, the real maximum is 511, mounting anywa > > > --- > This report is generated by a bot. It may contain errors. > See https://goo.gl/tpsmEJ for more information about syzbot. > syzbot engineers can be reached at syzkal...@googlegroups.com. > > syzbot will keep track of this issue. See: > https://goo.gl/tpsmEJ#status for how to communicate with syzbot. > syzbot can test patches for this issue, for details see: > https://goo.gl/tpsmEJ#testing-patches > > -- > You received this message because you are subscribed to the Google Groups > "syzkaller-bugs" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to syzkaller-bugs+unsubscr...@googlegroups.com. > To view this discussion on the web visit > https://groups.google.com/d/msgid/syzkaller-bugs/fbb4f505b5df4eea%40google.com.
WARNING: filesystem loop0 was created with 512 inodes, the real maximum is 511, mounting anywa
Hello, syzbot found the following issue on: HEAD commit:34816d20 Merge tag 'gfs2-v5.10-rc5-fixes' of git://git.ker.. git tree: upstream console output: https://syzkaller.appspot.com/x/log.txt?x=157dad0750 kernel config: https://syzkaller.appspot.com/x/.config?x=b3a044ccf5b03ac4 dashboard link: https://syzkaller.appspot.com/bug?extid=02c44c7f92e70a73730a compiler: gcc (GCC) 10.1.0-syz 20200507 syz repro: https://syzkaller.appspot.com/x/repro.syz?x=152b05ab50 C reproducer: https://syzkaller.appspot.com/x/repro.c?x=14fc3fad50 IMPORTANT: if you fix the issue, please add the following tag to the commit: Reported-by: syzbot+02c44c7f92e70a737...@syzkaller.appspotmail.com BFS-fs: bfs_fill_super(): WARNING: filesystem loop0 was created with 512 inodes, the real maximum is 511, mounting anywa --- This report is generated by a bot. It may contain errors. See https://goo.gl/tpsmEJ for more information about syzbot. syzbot engineers can be reached at syzkal...@googlegroups.com. syzbot will keep track of this issue. See: https://goo.gl/tpsmEJ#status for how to communicate with syzbot. syzbot can test patches for this issue, for details see: https://goo.gl/tpsmEJ#testing-patches
[PATCH 10/10] proc: Use a list of inodes to flush from proc
From: "Eric W. Biederman" [ Upstream commit 7bc3e6e55acf065500a24621f3b313e7e5998acf ] Rework the flushing of proc to use a list of directory inodes that need to be flushed. The list is kept on struct pid not on struct task_struct, as there is a fixed connection between proc inodes and pids but at least for the case of de_thread the pid of a task_struct changes. This removes the dependency on proc_mnt which allows for different mounts of proc having different mount options even in the same pid namespace and this allows for the removal of proc_mnt which will trivially the first mount of proc to honor it's mount options. This flushing remains an optimization. The functions pid_delete_dentry and pid_revalidate ensure that ordinary dcache management will not attempt to use dentries past the point their respective task has died. When unused the shrinker will eventually be able to remove these dentries. There is a case in de_thread where proc_flush_pid can be called early for a given pid. Which winds up being safe (if suboptimal) as this is just an optiimization. Only pid directories are put on the list as the other per pid files are children of those directories and d_invalidate on the directory will get them as well. So that the pid can be used during flushing it's reference count is taken in release_task and dropped in proc_flush_pid. Further the call of proc_flush_pid is moved after the tasklist_lock is released in release_task so that it is certain that the pid has already been unhashed when flushing it taking place. This removes a small race where a dentry could recreated. As struct pid is supposed to be small and I need a per pid lock I reuse the only lock that currently exists in struct pid the the wait_pidfd.lock. The net result is that this adds all of this functionality with just a little extra list management overhead and a single extra pointer in struct pid. v2: Initialize pid->inodes. I somehow failed to get that initialization into the initial version of the patch. A boot failure was reported by "kernel test robot ", and failure to initialize that pid->inodes matches all of the reported symptoms. Signed-off-by: Eric W. Biederman Fixes: f333c700c610 ("pidns: Add a limit on the number of pid namespaces") Fixes: 60347f6716aa ("pid namespaces: prepare proc_flust_task() to flush entries from multiple proc trees") Cc: # 4.9.x: b3e5838: clone: add CLONE_PIDFD Cc: # 4.9.x: b53b0b9: pidfd: add polling support Cc: # 4.9.x: db978da: proc: Pass file mode to proc_pid_make_inode Cc: # 4.9.x: 68eb94f: proc: Better ownership of files for non-dumpable tasks in user namespaces Cc: # 4.9.x: e3912ac: proc: use %u for pid printing and slightly less stack Cc: # 4.9.x: 0afa5ca: proc: Rename in proc_inode rename sysctl_inodes sibling_inodes Cc: # 4.9.x: 26dbc60: proc: Generalize proc_sys_prune_dcache into proc_prune_siblings_dcache Cc: # 4.9.x: 7144801: proc: Clear the pieces of proc_inode that proc_evict_inode cares about Cc: # 4.9.x: f90f3ca: Use d_invalidate in proc_prune_siblings_dcache Cc: # 4.9.x (proc: fix up cherry-pick conflicts for 7bc3e6e55acf) Signed-off-by: Wen Yang --- fs/proc/base.c | 111 fs/proc/inode.c | 2 +- fs/proc/internal.h | 1 + include/linux/pid.h | 1 + include/linux/proc_fs.h | 4 +- kernel/exit.c | 5 ++- kernel/pid.c| 1 + 7 files changed, 45 insertions(+), 80 deletions(-) diff --git a/fs/proc/base.c b/fs/proc/base.c index 3502a40..11caf35 100644 --- a/fs/proc/base.c +++ b/fs/proc/base.c @@ -1728,11 +1728,25 @@ void task_dump_owner(struct task_struct *task, mode_t mode, *rgid = gid; } +void proc_pid_evict_inode(struct proc_inode *ei) +{ + struct pid *pid = ei->pid; + + if (S_ISDIR(ei->vfs_inode.i_mode)) { + spin_lock(>wait_pidfd.lock); + hlist_del_init_rcu(>sibling_inodes); + spin_unlock(>wait_pidfd.lock); + } + + put_pid(pid); +} + struct inode *proc_pid_make_inode(struct super_block * sb, struct task_struct *task, umode_t mode) { struct inode * inode; struct proc_inode *ei; + struct pid *pid; /* We need a new inode */ @@ -1750,10 +1764,18 @@ struct inode *proc_pid_make_inode(struct super_block * sb, /* * grab the reference to task. */ - ei->pid = get_task_pid(task, PIDTYPE_PID); - if (!ei->pid) + pid = get_task_pid(task, PIDTYPE_PID); + if (!pid) goto out_unlock; + /* Let the pid remember us for quick removal */ + ei->pid = pid; + if (S_ISDIR(mode)) { + spin_lock(>wait_pidfd.lock); + hlist_add_head_rcu(>sibling_inodes, >inodes); + spin_unlock(>wait_pidfd.lock); + }
Re: WARNING: filesystem loop5 was created with 512 inodes, the real maximum is 511, mounting anyway
On Thu, Dec 3, 2020 at 1:55 PM Dmitry Vyukov wrote: > > On Thu, Dec 3, 2020 at 5:15 AM Randy Dunlap wrote: > > > > On 12/1/20 1:17 PM, Randy Dunlap wrote: > > > On 11/30/20 11:47 PM, Dmitry Vyukov wrote: > > >> On Tue, Dec 1, 2020 at 2:03 AM Randy Dunlap > > >> wrote: > > >>> > > >>> On 11/30/20 12:43 AM, Dmitry Vyukov wrote: > > >>>> On Mon, Nov 30, 2020 at 5:29 AM Randy Dunlap > > >>>> wrote: > > >>>>> > > >>>>> On 11/27/20 4:32 AM, syzbot wrote: > > >>>>>> Hello, > > >>>>>> > > >>>>>> syzbot found the following issue on: > > >>>>>> > > >>>>>> HEAD commit:418baf2c Linux 5.10-rc5 > > >>>>>> git tree: upstream > > >>>>>> console output: > > >>>>>> https://syzkaller.appspot.com/x/log.txt?x=171555b950 > > >>>>>> kernel config: > > >>>>>> https://syzkaller.appspot.com/x/.config?x=b81aff78c272da44 > > >>>>>> dashboard link: > > >>>>>> https://syzkaller.appspot.com/bug?extid=3fd34060f26e766536ff > > >>>>>> compiler: gcc (GCC) 10.1.0-syz 20200507 > > >>>>>> > > >>>>>> Unfortunately, I don't have any reproducer for this issue yet. > > >>>>>> > > >>>>>> IMPORTANT: if you fix the issue, please add the following tag to the > > >>>>>> commit: > > >>>>>> Reported-by: syzbot+3fd34060f26e76653...@syzkaller.appspotmail.com > > >>>>>> > > >>>>>> BFS-fs: bfs_fill_super(): loop5 is unclean, continuing > > >>>>>> BFS-fs: bfs_fill_super(): WARNING: filesystem loop5 was created with > > >>>>>> 512 inodes, the real maximum is 511, mounting anyway > > >>>>>> BFS-fs: bfs_fill_super(): Last block not available on loop5: 120 > > >>>>>> BFS-fs: bfs_fill_super(): loop5 is unclean, continuing > > >>>>>> BFS-fs: bfs_fill_super(): WARNING: filesystem loop5 was created with > > >>>>>> 512 inodes, the real maximum is 511, mounting anyway > > >>>>>> BFS-fs: bfs_fill_super(): Last block not available on loop5: 120 > > >>>>>> > > >>>>>> > > >>>>>> --- > > >>>>>> This report is generated by a bot. It may contain errors. > > >>>>>> See https://goo.gl/tpsmEJ for more information about syzbot. > > >>>>>> syzbot engineers can be reached at syzkal...@googlegroups.com. > > >>>>>> > > >>>>>> syzbot will keep track of this issue. See: > > >>>>>> https://goo.gl/tpsmEJ#status for how to communicate with syzbot. > > > > ... > > > > >>>> Hi Randy, > > >>>> > > >>>> I see this bug was reported with a reproducer: > > >>>> https://syzkaller.appspot.com/bug?id=a32ebd5db2f7c957b82cf54b97bdecf367bf0421 > > >>>> I assume it's a dup of this one. > > >>> > > >>> Sure, looks the same. > > >>> > > >>>> If you need the image itself, you can dump it to a file in the C > > >>>> reproducer inside of syz_mount_image before mount call. > > >>> > > >>> Yes, got that. > > >>> > > >>> What outcome or result are you looking for here? > > >>> Or what do you see as the problem? > > >> > > >> Hi Randy, > > >> > > >> "WARNING:" in kernel output is supposed to mean a kernel source bug. > > >> Presence of that kernel bug is what syzbot has reported. > > >> > > >> Note: the bug may be a misuse of the "WARNING:" for invalid user > > >> inputs in output as well :) > > > > > > > > > [adding Al Viro] > > > > > > Hi Dmitry, > > > > > > I expect that the "WARNING:" message is being interpreted incorrectly > > > here, > > > but that's a minor issue IMO. > > > > > > if (info->si_lasti == BFS_MAX_LASTI) > > > printf("WARNING: filesystem %s was created with 512 inodes, > > > the real maximum is 511, mounting anyway\n", s->s_id); > > > > > > > ... > > > > > > > > > > > However, in testing this, I see that the BFS image is not mounted > > > on /dev/loop# at all. > > > > > > 'mount' says: > > > > > > # mount -t bfs -o loop bfsfilesyz000.img /mnt/stand > > > mount: /mnt/stand: mount(2) system call failed: Not a directory. > > > > > > (but it is a directory) > > > > > > and I have tracked that down to fs/namespace.c::graft_tree() > > > returning -ENOTDIR, but I don't know why that is happening. > > > > > > > > > Al, can you provide any insights on this? > > > > OK, with Al's help, here is the situation. > > > > If I use a regular file instead of a directory, the mount > > command succeeds. > > > > The printk() from fs/bfs/inode.c that uses the WARNING: string > > is not a WARN() or WARN_ON(). It's just a printk(). > > > > says: > > > > * Do not include "BUG"/"WARNING" in format strings manually to make these > > * conditions distinguishable from kernel issues. > > > > so if I change fs/bfs/inode.c to use "warning:" or "Warning," or "Note:", > > this little problem should go away. Is that correct? > > Hi, > > Yes, any of these prefixes will work (not be considered as a kernel > issue). syzkaller only matches "WARNING:" verbatim. I don't know about > all other kernel testing systems, but at least it's distinguishable. > > Maybe also worth adding "bfs:" prefix for cases when people stare at > dmesg afterwards. Oh, sorry, there are already enough prefixes (BFS-fs: bfs_fill_super():).