subject:"Inodes"

[PATCH AUTOSEL 6.7 19/39] tracefs/eventfs: Use root and instance inodes as default ownership

2024-01-28 Thread Sasha Levin

(const char *name, 
umode_t mode,
   struct dentry *parent, void *data,
   const struct file_operations *fops)
 {
+   struct tracefs_inode *ti;
struct dentry *dentry;
struct inode *inode;
 
@@ -621,7 +635,11 @@ struct dentry *tracefs_create_file(const char *name, 
umode_t mode,
if (unlikely(!inode))
return tracefs_failed_creating(dentry);
 
+   ti = get_tracefs(inode);
+   ti->private = instance_inode(parent, inode);
+
inode->i_mode = mode;
+   inode->i_op = _file_inode_operations;
inode->i_fop = fops ? fops : _file_operations;
inode->i_private = data;
inode->i_uid = d_inode(dentry->d_parent)->i_uid;
@@ -634,6 +652,7 @@ struct dentry *tracefs_create_file(const char *name, 
umode_t mode,
 static struct dentry *__create_dir(const char *name, struct dentry *parent,
   const struct inode_operations *ops)
 {
+   struct tracefs_inode *ti;
struct dentry *dentry = tracefs_start_creating(name, parent);
struct inode *inode;
 
@@ -651,6 +670,9 @@ static struct dentry *__create_dir(const char *name, struct 
dentry *parent,
inode->i_uid = d_inode(dentry->d_parent)->i_uid;
inode->i_gid = d_inode(dentry->d_parent)->i_gid;
 
+   ti = get_tracefs(inode);
+   ti->private = instance_inode(parent, inode);
+
/* directory inodes start off with i_nlink == 2 (for "." entry) */
inc_nlink(inode);
d_instantiate(dentry, inode);
@@ -681,7 +703,7 @@ struct dentry *tracefs_create_dir(const char *name, struct 
dentry *parent)
if (security_locked_down(LOCKDOWN_TRACEFS))
return NULL;
 
-   return __create_dir(name, parent, _dir_inode_operations);
+   return __create_dir(name, parent, _dir_inode_operations);
 }
 
 /**
@@ -712,7 +734,7 @@ __init struct dentry *tracefs_create_instance_dir(const 
char *name,
if (WARN_ON(tracefs_ops.mkdir || tracefs_ops.rmdir))
return NULL;
 
-   dentry = __create_dir(name, parent, _dir_inode_operations);
+   dentry = __create_dir(name, parent, 
_instance_dir_inode_operations);
if (!dentry)
return NULL;
 
diff --git a/fs/tracefs/internal.h b/fs/tracefs/internal.h
index 42bdeb471a07..12b7d0150ae9 100644
--- a/fs/tracefs/internal.h
+++ b/fs/tracefs/internal.h
@@ -5,6 +5,9 @@
 enum {
TRACEFS_EVENT_INODE = BIT(1),
TRACEFS_EVENT_TOP_INODE = BIT(2),
+   TRACEFS_GID_PERM_SET= BIT(3),
+   TRACEFS_UID_PERM_SET= BIT(4),
+   TRACEFS_INSTANCE_INODE  = BIT(5),
 };
 
 struct tracefs_inode {
-- 
2.43.0

Re: [RESEND] [PATCH] eventfs: Have inodes have unique inode numbers

2024-01-26 Thread Steven Rostedt

On Fri, 26 Jan 2024 15:24:17 -0500
Mathieu Desnoyers  wrote:

> On 2024-01-26 15:12, Steven Rostedt wrote:
> [...]
> > diff --git a/fs/tracefs/inode.c b/fs/tracefs/inode.c
> > index e1b172c0e091..2187be6d7b23 100644
> > --- a/fs/tracefs/inode.c
> > +++ b/fs/tracefs/inode.c
> > @@ -223,13 +223,41 @@ static const struct inode_operations 
> > tracefs_file_inode_operations = {
> > .setattr= tracefs_setattr,
> >   };
> >   
> > +/* Copied from get_next_ino() but adds allocation for multiple inodes */
> > +#define LAST_INO_BATCH 1024
> > +#define LAST_INO_MASK (~(LAST_INO_BATCH - 1))
> > +static DEFINE_PER_CPU(unsigned int, last_ino);
> > +
> > +unsigned int tracefs_get_next_ino(int files)
> > +{
> > +   unsigned int *p = _cpu_var(last_ino);
> > +   unsigned int res = *p;
> > +
> > +#ifdef CONFIG_SMP
> > +   /* Check if adding files+1 overflows */  
> 
> How does it handle a @files input where:
> 
> * (files+1 > LAST_INO_BATCH) ?
> 
> * (files+1 == LAST_INO_BATCH) ?

Well, this is moot anyway, as Linus hates it.

-- Steve

Re: [RESEND] [PATCH] eventfs: Have inodes have unique inode numbers

2024-01-26 Thread Mathieu Desnoyers


On 2024-01-26 15:12, Steven Rostedt wrote:
[...]

diff --git a/fs/tracefs/inode.c b/fs/tracefs/inode.c
index e1b172c0e091..2187be6d7b23 100644
--- a/fs/tracefs/inode.c
+++ b/fs/tracefs/inode.c
@@ -223,13 +223,41 @@ static const struct inode_operations 
tracefs_file_inode_operations = {
.setattr= tracefs_setattr,
  };
  
+/* Copied from get_next_ino() but adds allocation for multiple inodes */

+#define LAST_INO_BATCH 1024
+#define LAST_INO_MASK (~(LAST_INO_BATCH - 1))
+static DEFINE_PER_CPU(unsigned int, last_ino);
+
+unsigned int tracefs_get_next_ino(int files)
+{
+   unsigned int *p = _cpu_var(last_ino);
+   unsigned int res = *p;
+
+#ifdef CONFIG_SMP
+   /* Check if adding files+1 overflows */


How does it handle a @files input where:

* (files+1 > LAST_INO_BATCH) ?

* (files+1 == LAST_INO_BATCH) ?


+   if (unlikely(!res || (res & LAST_INO_MASK) != ((res + files + 1) & 
LAST_INO_MASK))) {
+   static atomic_t shared_last_ino;
+   int next = atomic_add_return(LAST_INO_BATCH, _last_ino);
+
+   res = next - LAST_INO_BATCH;
+   }
+#endif
+
+   res++;
+   /* get_next_ino should not provide a 0 inode number */
+   if (unlikely(!res))
+   res++;


I suspect that bumping this res++ in the 0 case can cause inode range
reservation issues at (files+1 == LAST_INO_BATCH-1).

Thanks,

Mathieu


+   *p = res + files;
+   put_cpu_var(last_ino);
+   return res;
+}


--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

[RESEND] [PATCH] eventfs: Have inodes have unique inode numbers

2024-01-26 Thread Steven Rostedt

From: "Steven Rostedt (Google)" 

Linus suggested to use the same inode numbers to make it easier to
implement getdents(), as it was creating inodes just for generating a
unique and consistent inode number. Linus suggested to just use the same
inode for all files and directories.

Later it was discovered that having directories with the same inode number
would mess up the "find" command, but Linus found that on 64 bit machines,
there was a hole in the eventfs_inode structure due to alignment that
could be used to store the inode numbers for directories. That fixed the
directory issue, but the files still had their own inode number.

The 'tar' command uses inode numbers for determining uniqueness between
files, which this would break. Currently, tar is broken with tracefs
because all files show a stat of zero size and tar doesn't copy anything.
But because tar cares about inode numbers, there could be other
applications that do too. It's best to have all files have unique inode
numbers.

Copy the get_next_ino() to tracefs_get_next_ino() that takes a "files"
parameter. As eventfs directories have a fixed number of files within
them, the number of inodes needed for the eventfs directory files is known
when the directory is created. The tracefs_get_next_ino() will return a
new inode number but also reserve the next "files" inode numbers that the
caller is free to use. Then when an inode for a file is created, its inode
number will be its parent directory's inode number plus the index into the
file array of that directory, giving each file a unique inode number that
can be retrieved at any time.

Signed-off-by: Steven Rostedt (Google) 
---
[
  Resending because I sent the first one to the wrong mailing list.
]

 fs/tracefs/event_inode.c | 31 +++
 fs/tracefs/inode.c   | 37 ++---
 fs/tracefs/internal.h|  1 +
 3 files changed, 50 insertions(+), 19 deletions(-)

diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c
index 6b211522a13e..7be7a694b106 100644
--- a/fs/tracefs/event_inode.c
+++ b/fs/tracefs/event_inode.c
@@ -32,14 +32,11 @@
  */
 static DEFINE_MUTEX(eventfs_mutex);
 
-/* Choose something "unique" ;-) */
-#define EVENTFS_FILE_INODE_INO 0x12c4e37
-
 /* Just try to make something consistent and unique */
-static int eventfs_dir_ino(struct eventfs_inode *ei)
+static int eventfs_dir_ino(struct eventfs_inode *ei, int nr_files)
 {
if (!ei->ino)
-   ei->ino = get_next_ino();
+   ei->ino = tracefs_get_next_ino(nr_files);
 
return ei->ino;
 }
@@ -327,6 +324,7 @@ void eventfs_update_gid(struct dentry *dentry, kgid_t gid)
  * @parent: parent dentry for this file.
  * @data: something that the caller will want to get to later on.
  * @fop: struct file_operations that should be used for this file.
+ * @ino: inode number for this file
  *
  * This function creates a dentry that represents a file in the eventsfs_inode
  * directory. The inode.i_private pointer will point to @data in the open()
@@ -335,7 +333,8 @@ void eventfs_update_gid(struct dentry *dentry, kgid_t gid)
 static struct dentry *create_file(const char *name, umode_t mode,
  struct eventfs_attr *attr,
  struct dentry *parent, void *data,
- const struct file_operations *fop)
+ const struct file_operations *fop,
+ unsigned int ino)
 {
struct tracefs_inode *ti;
struct dentry *dentry;
@@ -363,9 +362,7 @@ static struct dentry *create_file(const char *name, umode_t 
mode,
inode->i_op = _file_inode_operations;
inode->i_fop = fop;
inode->i_private = data;
-
-   /* All files will have the same inode number */
-   inode->i_ino = EVENTFS_FILE_INODE_INO;
+   inode->i_ino = ino;
 
ti = get_tracefs(inode);
ti->flags |= TRACEFS_EVENT_INODE;
@@ -377,12 +374,14 @@ static struct dentry *create_file(const char *name, 
umode_t mode,
 /**
  * create_dir - create a dir in the tracefs filesystem
  * @ei: the eventfs_inode that represents the directory to create
- * @parent: parent dentry for this file.
+ * @parent: parent dentry for this directory.
+ * @nr_files: The number of files (not directories) this directory has
  *
  * This function will create a dentry for a directory represented by
  * a eventfs_inode.
  */
-static struct dentry *create_dir(struct eventfs_inode *ei, struct dentry 
*parent)
+static struct dentry *create_dir(struct eventfs_inode *ei, struct dentry 
*parent,
+int nr_files)
 {
struct tracefs_inode *ti;
struct dentry *dentry;
@@ -404,7 +403,7 @@ static struct dentry *create_dir(struct eventfs_inode *ei, 
struct dentry *parent
inode->i_fop = _file_operations;

Re: [PATCH v3 1/2] eventfs: Have the inodes all for files and directories all be the same

2024-01-22 Thread Darrick J. Wong

On Mon, Jan 22, 2024 at 02:02:28PM -0800, Linus Torvalds wrote:
> On Mon, 22 Jan 2024 at 13:59, Darrick J. Wong  wrote:
> >
> >  though I don't think
> > leaking raw kernel pointers is an awesome idea.
> 
> Yeah, I wasn't all that comfortable even with trying to hash it
> (because I think the number of source bits is small enough that even
> with a crypto hash, it's trivially brute-forceable).
> 
> See
> 
>https://lore.kernel.org/all/20240122152748.46897...@gandalf.local.home/
> 
> for the current patch under discussion (and it contains a link _to_
> said discussion).

Ah, cool, thank you!

--D

>Linus

Re: [PATCH v3 1/2] eventfs: Have the inodes all for files and directories all be the same

2024-01-22 Thread Linus Torvalds

On Mon, 22 Jan 2024 at 13:59, Darrick J. Wong  wrote:
>
>  though I don't think
> leaking raw kernel pointers is an awesome idea.

Yeah, I wasn't all that comfortable even with trying to hash it
(because I think the number of source bits is small enough that even
with a crypto hash, it's trivially brute-forceable).

See

   https://lore.kernel.org/all/20240122152748.46897...@gandalf.local.home/

for the current patch under discussion (and it contains a link _to_
said discussion).

   Linus

Re: [PATCH v3 1/2] eventfs: Have the inodes all for files and directories all be the same

2024-01-22 Thread Darrick J. Wong

On Tue, Jan 16, 2024 at 05:55:32PM -0500, Steven Rostedt wrote:
> From: "Steven Rostedt (Google)" 
> 
> The dentries and inodes are created in the readdir for the sole purpose of
> getting a consistent inode number. Linus stated that is unnecessary, and
> that all inodes can have the same inode number. For a virtual file system
> they are pretty meaningless.
> 
> Instead use a single unique inode number for all files and one for all
> directories.
> 
> Link: https://lore.kernel.org/all/20240116133753.2808d...@gandalf.local.home/
> Link: 
> https://lore.kernel.org/linux-trace-kernel/20240116211353.412180...@goodmis.org
> 
> Cc: Masami Hiramatsu 
> Cc: Mark Rutland 
> Cc: Mathieu Desnoyers 
> Cc: Christian Brauner 
> Cc: Al  Viro 
> Cc: Ajay Kaher 
> Suggested-by: Linus Torvalds 
> Signed-off-by: Steven Rostedt (Google) 
> ---
>  fs/tracefs/event_inode.c | 10 ++
>  1 file changed, 10 insertions(+)
> 
> diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c
> index fdff53d5a1f8..5edf0b96758b 100644
> --- a/fs/tracefs/event_inode.c
> +++ b/fs/tracefs/event_inode.c
> @@ -32,6 +32,10 @@
>   */
>  static DEFINE_MUTEX(eventfs_mutex);
>  
> +/* Choose something "unique" ;-) */
> +#define EVENTFS_FILE_INODE_INO   0x12c4e37
> +#define EVENTFS_DIR_INODE_INO0x134b2f5
> +
>  /*
>   * The eventfs_inode (ei) itself is protected by SRCU. It is released from
>   * its parent's list and will have is_freed set (under eventfs_mutex).
> @@ -352,6 +356,9 @@ static struct dentry *create_file(const char *name, 
> umode_t mode,
>   inode->i_fop = fop;
>   inode->i_private = data;
>  
> + /* All files will have the same inode number */
> + inode->i_ino = EVENTFS_FILE_INODE_INO;
> +
>   ti = get_tracefs(inode);
>   ti->flags |= TRACEFS_EVENT_INODE;
>   d_instantiate(dentry, inode);
> @@ -388,6 +395,9 @@ static struct dentry *create_dir(struct eventfs_inode 
> *ei, struct dentry *parent
>   inode->i_op = _root_dir_inode_operations;
>   inode->i_fop = _file_operations;
>  
> + /* All directories will have the same inode number */
> + inode->i_ino = EVENTFS_DIR_INODE_INO;

Regrettably, this leads to find failing on 6.8-rc1 (see xfs/55[89] in
fstests):

# find /sys/kernel/debug/tracing/ >/dev/null
find: File system loop detected; 
‘/sys/kernel/debug/tracing/events/initcall/initcall_finish’ is part of the same 
file system loop as ‘/sys/kernel/debug/tracing/events/initcall’.
find: File system loop detected; 
‘/sys/kernel/debug/tracing/events/initcall/initcall_start’ is part of the same 
file system loop as ‘/sys/kernel/debug/tracing/events/initcall’.
find: File system loop detected; 
‘/sys/kernel/debug/tracing/events/initcall/initcall_level’ is part of the same 
file system loop as ‘/sys/kernel/debug/tracing/events/initcall’.

There were no such reports on 6.7.0; AFAICT find(1) is tripping over
parent and child subdirectory having the same dev/i_ino.  Changing this
line to the following:

/* All directories will NOT have the same inode number */
inode->i_ino = (unsigned long)inode;

makes the messages about filesystem loops go away, though I don't think
leaking raw kernel pointers is an awesome idea.

--D

> +
>   ti = get_tracefs(inode);
>   ti->flags |= TRACEFS_EVENT_INODE;
>  
> -- 
> 2.43.0
> 
> 
>

Re: [PATCH] eventfs: Save directory inodes in the eventfs_inode structure

2024-01-22 Thread Kees Cook

On Mon, Jan 22, 2024 at 03:27:48PM -0500, Steven Rostedt wrote:
> From: "Steven Rostedt (Google)" 
> 
> The eventfs inodes and directories are allocated when referenced. But this
> leaves the issue of keeping consistent inode numbers and the number is
> only saved in the inode structure itself. When the inode is no longer
> referenced, it can be freed. When the file that the inode was representing
> is referenced again, the inode is once again created, but the inode number
> needs to be the same as it was before.
> 
> Just making the inode numbers the same for all files is fine, but that
> does not work with directories. The find command will check for loops via
> the inode number and having the same inode number for directories triggers:
> 
>   # find /sys/kernel/tracing
> find: File system loop detected;
> '/sys/kernel/debug/tracing/events/initcall/initcall_finish' is part of the 
> same file system loop as
> '/sys/kernel/debug/tracing/events/initcall'.
> [..]
> 
> Linus pointed out that the eventfs_inode structure ends with a single
> 32bit int, and on 64 bit machines, there's likely a 4 byte hole due to
> alignment. We can use this hole to store the inode number for the
> eventfs_inode. All directories in eventfs are represented by an
> eventfs_inode and that data structure can hold its inode number.
> 
> That last int was also purposely placed at the end of the structure to
> prevent holes from within. Now that there's a 4 byte number to hold the
> inode, both the inode number and the last integer can be moved up in the
> structure for better cache locality, where the llist and rcu fields can be
> moved to the end as they are only used when the eventfs_inode is being
> deleted.
> 
> Link: 
> https://lore.kernel.org/all/CAMuHMdXKiorg-jiuKoZpfZyDJ3Ynrfb8=x+c7x0eewxn-yr...@mail.gmail.com/
> 
> Reported-by: Geert Uytterhoeven 
> Fixes: 53c41052ba31 ("eventfs: Have the inodes all for files and directories 
> all be the same")
> Signed-off-by: Steven Rostedt (Google) 

Since I reviewed the earlier patch, I will repeat here for the formal
one too. :) Thanks for avoiding the hashing!

Reviewed-by: Kees Cook 

-- 
Kees Cook

[PATCH] eventfs: Save directory inodes in the eventfs_inode structure

2024-01-22 Thread Steven Rostedt

From: "Steven Rostedt (Google)" 

The eventfs inodes and directories are allocated when referenced. But this
leaves the issue of keeping consistent inode numbers and the number is
only saved in the inode structure itself. When the inode is no longer
referenced, it can be freed. When the file that the inode was representing
is referenced again, the inode is once again created, but the inode number
needs to be the same as it was before.

Just making the inode numbers the same for all files is fine, but that
does not work with directories. The find command will check for loops via
the inode number and having the same inode number for directories triggers:

  # find /sys/kernel/tracing
find: File system loop detected;
'/sys/kernel/debug/tracing/events/initcall/initcall_finish' is part of the same 
file system loop as
'/sys/kernel/debug/tracing/events/initcall'.
[..]

Linus pointed out that the eventfs_inode structure ends with a single
32bit int, and on 64 bit machines, there's likely a 4 byte hole due to
alignment. We can use this hole to store the inode number for the
eventfs_inode. All directories in eventfs are represented by an
eventfs_inode and that data structure can hold its inode number.

That last int was also purposely placed at the end of the structure to
prevent holes from within. Now that there's a 4 byte number to hold the
inode, both the inode number and the last integer can be moved up in the
structure for better cache locality, where the llist and rcu fields can be
moved to the end as they are only used when the eventfs_inode is being
deleted.

Link: 
https://lore.kernel.org/all/CAMuHMdXKiorg-jiuKoZpfZyDJ3Ynrfb8=x+c7x0eewxn-yr...@mail.gmail.com/

Reported-by: Geert Uytterhoeven 
Fixes: 53c41052ba31 ("eventfs: Have the inodes all for files and directories 
all be the same")
Signed-off-by: Steven Rostedt (Google) 
---
 fs/tracefs/event_inode.c | 14 +++---
 fs/tracefs/internal.h|  7 ---
 2 files changed, 15 insertions(+), 6 deletions(-)

diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c
index 6795fda2af19..6b211522a13e 100644
--- a/fs/tracefs/event_inode.c
+++ b/fs/tracefs/event_inode.c
@@ -34,7 +34,15 @@ static DEFINE_MUTEX(eventfs_mutex);
 
 /* Choose something "unique" ;-) */
 #define EVENTFS_FILE_INODE_INO 0x12c4e37
-#define EVENTFS_DIR_INODE_INO  0x134b2f5
+
+/* Just try to make something consistent and unique */
+static int eventfs_dir_ino(struct eventfs_inode *ei)
+{
+   if (!ei->ino)
+   ei->ino = get_next_ino();
+
+   return ei->ino;
+}
 
 /*
  * The eventfs_inode (ei) itself is protected by SRCU. It is released from
@@ -396,7 +404,7 @@ static struct dentry *create_dir(struct eventfs_inode *ei, 
struct dentry *parent
inode->i_fop = _file_operations;
 
/* All directories will have the same inode number */
-   inode->i_ino = EVENTFS_DIR_INODE_INO;
+   inode->i_ino = eventfs_dir_ino(ei);
 
ti = get_tracefs(inode);
ti->flags |= TRACEFS_EVENT_INODE;
@@ -802,7 +810,7 @@ static int eventfs_iterate(struct file *file, struct 
dir_context *ctx)
 
name = ei_child->name;
 
-   ino = EVENTFS_DIR_INODE_INO;
+   ino = eventfs_dir_ino(ei_child);
 
if (!dir_emit(ctx, name, strlen(name), ino, DT_DIR))
goto out_dec;
diff --git a/fs/tracefs/internal.h b/fs/tracefs/internal.h
index 12b7d0150ae9..45397df9bb65 100644
--- a/fs/tracefs/internal.h
+++ b/fs/tracefs/internal.h
@@ -55,6 +55,10 @@ struct eventfs_inode {
struct eventfs_attr *entry_attrs;
struct eventfs_attr attr;
void*data;
+   unsigned intis_freed:1;
+   unsigned intis_events:1;
+   unsigned intnr_entries:30;
+   unsigned intino;
/*
 * Union - used for deletion
 * @llist:  for calling dput() if needed after RCU
@@ -64,9 +68,6 @@ struct eventfs_inode {
struct llist_node   llist;
struct rcu_head rcu;
};
-   unsigned intis_freed:1;
-   unsigned intis_events:1;
-   unsigned intnr_entries:30;
 };
 
 static inline struct tracefs_inode *get_tracefs(const struct inode *inode)
-- 
2.43.0

[PATCH v3 2/2] eventfs: Do not create dentries nor inodes in iterate_shared

2024-01-16 Thread Steven Rostedt

From: "Steven Rostedt (Google)" 

The original eventfs code added a wrapper around the dcache_readdir open
callback and created all the dentries and inodes at open, and increment
their ref count. A wrapper was added around the dcache_readdir release
function to decrement all the ref counts of those created inodes and
dentries. But this proved to be buggy[1] for when a kprobe was created
during a dir read, it would create a dentry between the open and the
release, and because the release would decrement all ref counts of all
files and directories, that would include the kprobe directory that was
not there to have its ref count incremented in open. This would cause the
ref count to go to negative and later crash the kernel.

To solve this, the dentries and inodes that were created and had their ref
count upped in open needed to be saved. That list needed to be passed from
the open to the release, so that the release would only decrement the ref
counts of the entries that were incremented in the open.

Unfortunately, the dcache_readdir logic was already using the
file->private_data, which is the only field that can be used to pass
information from the open to the release. What was done was the eventfs
created another descriptor that had a void pointer to save the
dcache_readdir pointer, and it wrapped all the callbacks, so that it could
save the list of entries that had their ref counts incremented in the
open, and pass it to the release. The wrapped callbacks would just put
back the dcache_readdir pointer and call the functions it used so it could
still use its data[2].

But Linus had an issue with the "hijacking" of the file->private_data
(unfortunately this discussion was on a security list, so no public link).
Which we finally agreed on doing everything within the iterate_shared
callback and leave the dcache_readdir out of it[3]. All the information
needed for the getents() could be created then.

But this ended up being buggy too[4]. The iterate_shared callback was not
the right place to create the dentries and inodes. Even Christian Brauner
had issues with that[5].

An attempt was to go back to creating the inodes and dentries at
the open, create an array to store the information in the
file->private_data, and pass that information to the other callbacks.[6]

The difference between that and the original method, is that it does not
use dcache_readdir. It also does not up the ref counts of the dentries and
pass them. Instead, it creates an array of a structure that saves the
dentry's name and inode number. That information is used in the
iterate_shared callback, and the array is freed in the dir release. The
dentries and inodes created in the open are not used for the iterate_share
or release callbacks. Just their names and inode numbers.

Linus did not like that either[7] and just wanted to remove the dentries
being created in iterate_shared and use the hard coded inode numbers.

[ All this while Linus enjoyed an unexpected vacation during the merge
  window due to lack of power. ]

[1] 
https://lore.kernel.org/linux-trace-kernel/20230919211804.230ed...@gandalf.local.home/
[2] 
https://lore.kernel.org/linux-trace-kernel/20230922163446.1431d...@gandalf.local.home/
[3] 
https://lore.kernel.org/linux-trace-kernel/20240104015435.682218...@goodmis.org/
[4] https://lore.kernel.org/all/202401152142.bfc28861-oliver.s...@intel.com/
[5] https://lore.kernel.org/all/20240111-unzahl-gefegt-433acb8a841d@brauner/
[6] https://lore.kernel.org/all/20240116114711.7e863...@gandalf.local.home/
[7] https://lore.kernel.org/all/20240116170154.5bf0a...@gandalf.local.home/

Link: 
https://lore.kernel.org/linux-trace-kernel/20240116211353.573784...@goodmis.org

Cc: Masami Hiramatsu 
Cc: Mark Rutland 
Cc: Mathieu Desnoyers 
Cc: Linus Torvalds 
Cc: Christian Brauner 
Cc: Al  Viro 
Cc: Ajay Kaher 
Fixes: 493ec81a8fb8 ("eventfs: Stop using dcache_readdir() for getdents()")
Reported-by: kernel test robot 
Closes: 
https://lore.kernel.org/oe-lkp/202401152142.bfc28861-oliver.s...@intel.com
Signed-off-by: Steven Rostedt (Google) 
---
 fs/tracefs/event_inode.c | 20 +---
 1 file changed, 5 insertions(+), 15 deletions(-)

diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c
index 5edf0b96758b..10580d6b5012 100644
--- a/fs/tracefs/event_inode.c
+++ b/fs/tracefs/event_inode.c
@@ -727,8 +727,6 @@ static int eventfs_iterate(struct file *file, struct 
dir_context *ctx)
struct eventfs_inode *ei_child;
struct tracefs_inode *ti;
struct eventfs_inode *ei;
-   struct dentry *ei_dentry = NULL;
-   struct dentry *dentry;
const char *name;
umode_t mode;
int idx;
@@ -749,11 +747,11 @@ static int eventfs_iterate(struct file *file, struct 
dir_context *ctx)
 
mutex_lock(_mutex);
ei = READ_ONCE(ti->private);
-   if (ei && !ei->is_freed)
-   ei_dentry = READ_ONCE(ei->dentry);
+

[PATCH v3 0/2] eventfs: Create dentries and inodes at dir open

2024-01-16 Thread Steven Rostedt



[ subject is still wrong, but is to match v2, see patch 2 for correct subject ]


Changes since v2: 
https://lore.kernel.org/all/20240116211217.968123...@goodmis.org/

Implemented Linus's suggestion to just change the iterate_shared to
use the hard coded inodes.


Steven Rostedt (Google) (2):
  eventfs: Have the inodes all for files and directories all be the same
  eventfs: Do not create dentries nor inodes in iterate_shared


 fs/tracefs/event_inode.c | 30 +++---
 1 file changed, 15 insertions(+), 15 deletions(-)

[PATCH v3 1/2] eventfs: Have the inodes all for files and directories all be the same

2024-01-16 Thread Steven Rostedt

From: "Steven Rostedt (Google)" 

The dentries and inodes are created in the readdir for the sole purpose of
getting a consistent inode number. Linus stated that is unnecessary, and
that all inodes can have the same inode number. For a virtual file system
they are pretty meaningless.

Instead use a single unique inode number for all files and one for all
directories.

Link: https://lore.kernel.org/all/20240116133753.2808d...@gandalf.local.home/
Link: 
https://lore.kernel.org/linux-trace-kernel/20240116211353.412180...@goodmis.org

Cc: Masami Hiramatsu 
Cc: Mark Rutland 
Cc: Mathieu Desnoyers 
Cc: Christian Brauner 
Cc: Al  Viro 
Cc: Ajay Kaher 
Suggested-by: Linus Torvalds 
Signed-off-by: Steven Rostedt (Google) 
---
 fs/tracefs/event_inode.c | 10 ++
 1 file changed, 10 insertions(+)

diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c
index fdff53d5a1f8..5edf0b96758b 100644
--- a/fs/tracefs/event_inode.c
+++ b/fs/tracefs/event_inode.c
@@ -32,6 +32,10 @@
  */
 static DEFINE_MUTEX(eventfs_mutex);
 
+/* Choose something "unique" ;-) */
+#define EVENTFS_FILE_INODE_INO 0x12c4e37
+#define EVENTFS_DIR_INODE_INO  0x134b2f5
+
 /*
  * The eventfs_inode (ei) itself is protected by SRCU. It is released from
  * its parent's list and will have is_freed set (under eventfs_mutex).
@@ -352,6 +356,9 @@ static struct dentry *create_file(const char *name, umode_t 
mode,
inode->i_fop = fop;
inode->i_private = data;
 
+   /* All files will have the same inode number */
+   inode->i_ino = EVENTFS_FILE_INODE_INO;
+
ti = get_tracefs(inode);
ti->flags |= TRACEFS_EVENT_INODE;
d_instantiate(dentry, inode);
@@ -388,6 +395,9 @@ static struct dentry *create_dir(struct eventfs_inode *ei, 
struct dentry *parent
inode->i_op = _root_dir_inode_operations;
inode->i_fop = _file_operations;
 
+   /* All directories will have the same inode number */
+   inode->i_ino = EVENTFS_DIR_INODE_INO;
+
ti = get_tracefs(inode);
ti->flags |= TRACEFS_EVENT_INODE;
 
-- 
2.43.0

[PATCH v2 0/2] eventfs: Create dentries and inodes at dir open

2024-01-16 Thread Steven Rostedt

[ subject is wrong, but is to match v1, see patch 2 for correct subject ]

Fix reading dir again, this time without creating dentries and inodes.

Changes since v1: 
https://lore.kernel.org/linux-trace-kernel/20240116114711.7e863...@gandalf.local.home


Steven Rostedt (Google) (2):
  eventfs: Have the inodes all for files and directories all be the same
  eventfs: Create list of files and directories at dir open


 fs/tracefs/event_inode.c | 223 +--
 1 file changed, 159 insertions(+), 64 deletions(-)

[PATCH v2 1/2] eventfs: Have the inodes all for files and directories all be the same

2024-01-16 Thread Steven Rostedt

From: "Steven Rostedt (Google)" 

The dentries and inodes are created in the readdir for the sole purpose of
getting a consistent inode number. Linus stated that is unnecessary, and
that all inodes can have the same inode number. For a virtual file system
they are pretty meaningless.

Instead use a single unique inode number for all files and one for all
directories.

Link: https://lore.kernel.org/all/20240116133753.2808d...@gandalf.local.home/

Suggested-by: Linus Torvalds 
Signed-off-by: Steven Rostedt (Google) 
---
 fs/tracefs/event_inode.c | 10 ++
 1 file changed, 10 insertions(+)

diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c
index fdff53d5a1f8..5edf0b96758b 100644
--- a/fs/tracefs/event_inode.c
+++ b/fs/tracefs/event_inode.c
@@ -32,6 +32,10 @@
  */
 static DEFINE_MUTEX(eventfs_mutex);
 
+/* Choose something "unique" ;-) */
+#define EVENTFS_FILE_INODE_INO 0x12c4e37
+#define EVENTFS_DIR_INODE_INO  0x134b2f5
+
 /*
  * The eventfs_inode (ei) itself is protected by SRCU. It is released from
  * its parent's list and will have is_freed set (under eventfs_mutex).
@@ -352,6 +356,9 @@ static struct dentry *create_file(const char *name, umode_t 
mode,
inode->i_fop = fop;
inode->i_private = data;
 
+   /* All files will have the same inode number */
+   inode->i_ino = EVENTFS_FILE_INODE_INO;
+
ti = get_tracefs(inode);
ti->flags |= TRACEFS_EVENT_INODE;
d_instantiate(dentry, inode);
@@ -388,6 +395,9 @@ static struct dentry *create_dir(struct eventfs_inode *ei, 
struct dentry *parent
inode->i_op = _root_dir_inode_operations;
inode->i_fop = _file_operations;
 
+   /* All directories will have the same inode number */
+   inode->i_ino = EVENTFS_DIR_INODE_INO;
+
ti = get_tracefs(inode);
ti->flags |= TRACEFS_EVENT_INODE;
 
-- 
2.43.0

Re: [PATCH] eventfs: Create dentries and inodes at dir open

2024-01-16 Thread Steven Rostedt

On Tue, 16 Jan 2024 10:53:36 -0800
Linus Torvalds  wrote:

> Let's at least *try* to move towards a better and simpler world, in other
> words.

OK, but I did just finish the below. I'll save it for another time if the
single inode becomes an issue.

I cropped it to just 31 bits, so I'm not sure how much that would still
expose kernel addresses.

But as you prefer a single inode number, I'll submit that instead, and
store this as one of my many git branches that I hoard.

This patch would give me:

~# ls -i /sys/kernel/tracing/events
2024739183 9p   801480813 iocost281269778 qdisc
 401505785 alarmtimer   641794539 iomap1206040657 ras
1930290274 avc 1239754069 iommu 275816503 raw_syscalls
1108006611 block138722947 io_uring 1758073266 rcu
1528091656 bpf_test_run 694421747 ipi  1384703124 regmap
 557364529 bpf_trace  2765888 irq30507018 regulator
1351362737 cfg80211 369905250 irq_matrix   2078862821 resctrl
 886445085 cgroup  1115008967 irq_vectors   324090967 rpm
 796209754 clk 1448778784 jbd2 1141318882 rseq
 478739129 compaction99410253 kmem 1274783780 rtc
1714397712 cpuhp   2001779594 ksm  1409072807 sched
 720701943 csd   51728677 kyber 347239934 scsi
1824588347 dev 1507974437 libata   1768671172 sd
1640041299 devfreq 1421146927 lock 1167562598 signal
 121399632 dma_fence956825721 mac802114116590 skb
 975908383 drm  738868061 maple_tree   1435501164 smbus
1227060804 e1000e_trace 969175760 mce  1664441095 sock
1770307058 enable  1225375766 mdio 1634697993 spi
1372107864 error_report 744198394 migrate   556841757 swiotlb
1356665351 exceptions   602669807 mmap  400337480 syscalls
[..]

~# ls -i /sys/kernel/tracing/events/sched
1906992193 enable1210369853 sched_process_wait
 592505447 filter1443020725 sched_stat_blocked
1742488280 sched_kthread_stop1213185672 sched_stat_iowait
1674576234 sched_kthread_stop_ret1908510325 sched_stat_runtime
1999376743 sched_kthread_work_execute_end1570203600 sched_stat_sleep
1041533096 sched_kthread_work_execute_start   608113040 sched_stat_wait
 166824308 sched_kthread_work_queue_work 2033295833 sched_stick_numa
 492462616 sched_migrate_task1617500395 sched_swap_numa
1786868196 sched_move_numa   1865216679 sched_switch
 691687388 sched_pi_setprio  1465729401 sched_wait_task
1610977146 sched_process_exec 969941227 
sched_wake_idle_without_ipi
 610128037 sched_process_exit  84398030 sched_wakeup
2139284699 sched_process_fork 750220489 sched_wakeup_new
 169172804 sched_process_free1024064201 sched_waking

-- Steve

diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c
index a0598eb3e26e..57448bf4acb4 100644
--- a/fs/tracefs/event_inode.c
+++ b/fs/tracefs/event_inode.c
@@ -19,6 +19,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -32,6 +33,9 @@
  */
 static DEFINE_MUTEX(eventfs_mutex);
 
+/* Used for making inode numbers */
+static siphash_key_t inode_key;
+
 /*
  * The eventfs_inode (ei) itself is protected by SRCU. It is released from
  * its parent's list and will have is_freed set (under eventfs_mutex).
@@ -57,6 +61,28 @@ static int eventfs_dir_open(struct inode *inode, struct file 
*file);
 static int eventfs_iterate(struct file *file, struct dir_context *ctx);
 static int eventfs_dir_release(struct inode *inode, struct file *file);
 
+/* Copied from scripts/kconfig/symbol.c */
+static unsigned strhash(const char *s)
+{
+   /* fnv32 hash */
+   unsigned hash = 2166136261U;
+   for (; *s; s++)
+   hash = (hash ^ *s) * 0x01000193;
+   return hash;
+}
+
+/* Just try to make something consistent and unique */
+static int eventfs_inode_ino(struct eventfs_inode *ei, const char *name)
+{
+   unsigned long sip = (unsigned long)ei;
+
+   sip += strhash(name);
+   sip = siphash_1u32((int)sip, _key);
+
+   /* keep it positive */
+   return sip & ((1U << 31) - 1);
+}
+
 static void update_attr(struct eventfs_attr *attr, struct iattr *iattr)
 {
unsigned int ia_valid = iattr->ia_valid;
@@ -491,6 +517,8 @@ create_file_dentry(struct eventfs_inode *ei, int idx,
mutex_unlock(_mutex);
 
dentry = create_file(name, mode, attr, parent, data, fops);
+   if (!IS_ERR_OR_NULL(dentry))
+   dentry->d_inode->i_ino = eventfs_inode_ino(ei, name);
 
mutex_lock(_mutex);
 
@@ -585,6 +613,8 @@ create_dir_dentry(struct eventfs_inode *pei, struct 
eventfs_inode *ei,
mutex_unlock(_mutex);
 
dentry = create_dir(ei, parent);
+   if (!IS_ERR_OR_NULL(dentry))
+

Re: [PATCH] eventfs: Create dentries and inodes at dir open

2024-01-16 Thread Steven Rostedt

On Tue, 16 Jan 2024 10:21:49 -0800
Linus Torvalds  wrote:

> Here's a clue: just fix your inode numbers.
> 
> I can think of many ways to do it. Here's a couple:
> 
>  - use a fixed inode number for all inodes. It's fine. Really. You might
> confuse some programs that still do getpwd() the legacy way, but hey,
> nobody cares
> 
>  - just put the inode number in the same data structure everything else is
>
> 
>  - make the inode number be a hash of the address of your data structure.
> That's actually the closest to a traditional "real" inode number, which is
> just an index to some on-disk thing
> 
> I'm sure there are other *trivial* solutions.
> 
> None of this is an excuse to misuse sentries.
> 
> Try the first one - one single inode number - first. You shouldn't be doing
> iget() anyway, so why do you care so deeply about a number that makes no
> sense and nobody should care about?

It was me being paranoid that using the same inode number would break user
space. If that is not a concern, then I'm happy to just make it either the
same, or maybe just hash the ei and name that it is associated with.

If I do not fully understand how something is used, I try hard to make it
act the same as it does for other use cases. That is, I did all this to
keep inodes unique and consistent because I did not know if it would break
something if I didn't.

Removing that requirement does make it much easier to implement readdir.

I think I'll do the hashing, just because I'm still paranoid that something
might still break if they are all the same.

-- Steve

Re: [PATCH] eventfs: Create dentries and inodes at dir open

2024-01-16 Thread Steven Rostedt

On Tue, 16 Jan 2024 13:12:28 -0500
Steven Rostedt  wrote:

> Maybe I can just use a hash to generate he inode numbers from the name?
> Hopefully there will be no collisions. Then I don't need the dentry
> creation at all.

Maybe I could use a hash of the address of the meta data to create the
inode number? That may be unique enough.

-- Steve

Re: [PATCH] eventfs: Create dentries and inodes at dir open

2024-01-16 Thread Steven Rostedt

On Tue, 16 Jan 2024 09:55:15 -0800
Linus Torvalds  wrote:

> [ html crud because I still don't have power or real Internet, just trying
> to keep an eye on things on my phone. Mailing lists removed to avoid
> bounces, please put them back in replies that don't have my horrible
> formatting ]
> 
> No.
> 
> Christ, you're making things worse again
> 
> The reason for the bug is that you're still messing with the dentries at
> readdir() time.

I may have deleted the comment, but the only reason I created the
inodes/destries is to keep the consistent inode number as it's created at
the time the inodes and dentries are.

Ah I did delete the comment, but it is still applicable :-/

   /*
-* Need to create the dentries and inodes to have a consistent
-* inode number.
+* Need to make a struct eventfs_dent array, start by
+* allocating enough for just the files, which is a fixed
+* array. Then use realloc via add_entry() for the directories
+* which is stored in a linked list.
 */

So if for some reason, user space did a getdents() and used the inode
numbers to match what it found, they would likely be the same.

> 
> Just stop it. Readdir should not be creating dentries. Readdir should not
> be even *looking* at dentries. You're not a virtual filesystem, the
> dentries are just caches for filename lookup, and have *nothing* to do with
> readdir.

Actually, it's not looking at them. I did it as a way to just have the
inode numbers be consistent.

dents[cnt].ino = d->d_inode->i_ino;

Yes, that's the only reason I create them. The dentry/inode is not used for
anything outside of that.

> 
> So just iterate over your own internal data structures in readdir. DO NOT
> CREATE DENTRIES OR INODES FOR READDIR.
> 
> I've told you before, and I'll tell you again: either you are a real and
> proper virtual filesystem and you let the vfs layer manage *everything*,
> and the dentries and inodes are all you have. Or you are a *real*
> filesystem and you maintain your own data structures and the dentries and
> inodes are just the in-memory caches.
> 
> This "do both" is UNACCEPTABLE.
> 

The dentries were created for the inode numbers so that I did not need to
add them to meta data. They are generated at creation time.

I don't know how important inode numbers are. If they can all just have
random inode numbers and it doesn't break user space, where an inode number
will be one value at one read and another shortly after, is that going to
cause a problem?

Maybe I can just use a hash to generate he inode numbers from the name?
Hopefully there will be no collisions. Then I don't need the dentry
creation at all.

I do realize that if the dentries get freed due to reclaim and recreated,
their inode numbers will be different. But for a virtual file system, I
don't know how important having consistent inode numbers is.

-- Steve

[PATCH] eventfs: Create dentries and inodes at dir open

2024-01-16 Thread Steven Rostedt

From: "Steven Rostedt (Google)" 

The original eventfs code added a wrapper around the dcache_readdir open
callback and created all the dentries and inodes at open, and increment
their ref count. A wrapper was added around the dcache_readdir release
function to decrement all the ref counts of those created inodes and
dentries. But this proved to be buggy[1] for when a kprobe was created
during a dir read, it would create a dentry between the open and the
release, and because the release would decrement all ref counts of all
files and directories, that would include the kprobe directory that was
not there to have its ref count incremented in open. This would cause the
ref count to go to negative and later crash the kernel.

To solve this, the dentries and inodes that were created and had their ref
count upped in open needed to be saved. That list needed to be passed from
the open to the release, so that the release would only decrement the ref
counts of the entries that were incremented in the open.

Unfortunately, the dcache_readdir logic was already using the
file->private_data, which is the only field that can be used to pass
information from the open to the release. What was done was the eventfs
created another descriptor that had a void pointer to save the
dcache_readdir pointer, and it wrapped all the callbacks, so that it could
save the list of entries that had their ref counts incremented in the
open, and pass it to the release. The wrapped callbacks would just put
back the dcache_readdir pointer and call the functions it used so it could
still use its data[2].

But Linus had an issue with the "hijacking" of the file->private_data
(unfortunately this discussion was on a security list, so no public link).
Which we finally agreed on doing everything within the iterate_shared
callback and leave the dcache_readdir out of it[3]. All the information
needed for the getents() could be created then.

But this ended up being buggy too[4]. The iterate_shared callback was not
the right place to create the dentries and inodes. Even Christian Brauner
had issues with that[5].

The real fix should be to go back to creating the inodes and dentries at
the open, create an array to store the information in the
file->private_data, and pass that information to the other callbacks.

The difference between this and the original method, is that it does not
use dcache_readdir. It also does not up the ref counts of the dentries and
pass them. Instead, it creates an array of a structure that saves the
dentry's name and inode number. That information is used in the
iterate_shared callback, and the array is freed in the dir release. The
dentries and inodes created in the open are not used for the iterate_share
or release callbacks. Just their names and inode numbers.

This means that the state of the eventfs at the dir open remains the same
from the point of view of the getdents() function, until the dir is closed.
This also means that the getdents() will not fail. If there's an issue, it
fails at the dir open.

[1] 
https://lore.kernel.org/linux-trace-kernel/20230919211804.230ed...@gandalf.local.home/
[2] 
https://lore.kernel.org/linux-trace-kernel/20230922163446.1431d...@gandalf.local.home/
[3] 
https://lore.kernel.org/linux-trace-kernel/20240104015435.682218...@goodmis.org/
[4] https://lore.kernel.org/all/202401152142.bfc28861-oliver.s...@intel.com/
[5] https://lore.kernel.org/all/20240111-unzahl-gefegt-433acb8a841d@brauner/

Fixes: 493ec81a8fb8 ("eventfs: Stop using dcache_readdir() for getdents()")
Reported-by: kernel test robot 
Closes: 
https://lore.kernel.org/oe-lkp/202401152142.bfc28861-oliver.s...@intel.com
Signed-off-by: Steven Rostedt (Google) 
---
 fs/tracefs/event_inode.c | 209 ---
 1 file changed, 153 insertions(+), 56 deletions(-)

diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c
index fdff53d5a1f8..50b37f31d835 100644
--- a/fs/tracefs/event_inode.c
+++ b/fs/tracefs/event_inode.c
@@ -53,7 +53,9 @@ enum {
 static struct dentry *eventfs_root_lookup(struct inode *dir,
  struct dentry *dentry,
  unsigned int flags);
+static int eventfs_dir_open(struct inode *inode, struct file *file);
 static int eventfs_iterate(struct file *file, struct dir_context *ctx);
+static int eventfs_dir_release(struct inode *inode, struct file *file);
 
 static void update_attr(struct eventfs_attr *attr, struct iattr *iattr)
 {
@@ -212,7 +214,9 @@ static const struct inode_operations 
eventfs_file_inode_operations = {
 
 static const struct file_operations eventfs_file_operations = {
.read   = generic_read_dir,
+   .open   = eventfs_dir_open,
.iterate_shared = eventfs_iterate,
+   .release= eventfs_dir_release,
.llseek = generic_file_llseek,
 };
 
@@ -706,34 +710,71 @@ static struct dentry *eventfs_root_lo

Re: [PATCH] tracefs/eventfs: Use root and instance inodes as default ownership

2024-01-12 Thread Steven Rostedt

On Fri, 12 Jan 2024 08:53:44 -0500
Steven Rostedt  wrote:

> > // We managed to open the directory so we have permission to list
> > // directory entries in "xfs".
> > fd = open("/sys/kernel/tracing/events/xfs");
> > 
> > // Remove ownership so we can't open the directory anymore
> > chown("/sys/kernel/tracing/events/xfs", 0, 0);
> > 
> > // Or just remove exec bit for the group and restrict to owner
> > chmod("/sys/kernel/tracing/events/xfs", 700);
> > 
> > // Drop caches to force an eventfs_root_lookup() on everything
> > write("/proc/sys/vm/drop_caches", "3", 1);  
> 
> This requires opening the directory, then having it's permissions
> change, and then immediately dropping the caches.
> 
> > 
> > // Returns 0 even though directory has a lot of entries and we should be
> > // able to list them
> > getdents64(fd, ...);  
> 
> And do we care?

Hmm, maybe the issue you have is with the inconsistency of the action?

For this to fail, it would require the admin to do both change the
permission and to flush caches. If you don't flush the caches then the
task with the dir open can still read it regardless. If the dentries
were already created.

In that case I'm fine if we change the creation of the dentries to not
check the permission.

But for now, it's just a weird side effect that I don't really see how
it would affect any user's workflow.

-- Steve

Re: [PATCH] tracefs/eventfs: Use root and instance inodes as default ownership

2024-01-12 Thread Steven Rostedt

On Fri, 12 Jan 2024 09:27:24 +0100
Christian Brauner  wrote:

> On Thu, Jan 11, 2024 at 04:53:19PM -0500, Steven Rostedt wrote:
> > On Thu, 11 Jan 2024 22:01:32 +0100
> > Christian Brauner  wrote:
> >   
> > > What I'm pointing out in the current logic is that the caller is
> > > taxed twice:
> > > 
> > > (1) Once when the VFS has done inode_permission(MAY_EXEC, "xfs")
> > > (2) And again when you call lookup_one_len() in eventfs_start_creating()
> > > _because_ the permission check in lookup_one_len() is the exact
> > > same permission check again that the vfs has done
> > > inode_permission(MAY_EXEC, "xfs").  
> > 
> > As I described in: 
> > https://lore.kernel.org/all/20240110133154.6e18f...@gandalf.local.home/
> > 
> > The eventfs files below "events" doesn't need the .permissions callback at
> > all. It's only there because the "events" inode uses it.
> > 
> > The .permissions call for eventfs has:  
> 
> It doesn't matter whether there's a ->permission handler. If you don't
> add one explicitly the VFS will simply call generic_permission():
> 
> inode_permission()
> -> do_inode_permission()  
>{
> if (unlikely(!(inode->i_opflags & IOP_FASTPERM))) {
> if (likely(inode->i_op->permission))
> return inode->i_op->permission(idmap, inode, mask);
>
> /* This gets set once for the inode lifetime */
> spin_lock(>i_lock);
> inode->i_opflags |= IOP_FASTPERM;
> spin_unlock(>i_lock);
> }
> return generic_permission(idmap, inode, mask);
>}

Yes I know that, because that's where I knew what to call in the non
"events" dir case.

> 
> > Anyway, the issue is with "events" directory and remounting, because like
> > the tracefs system, the inode and dentry for "evnets" is created at boot
> > up, before the mount happens. The VFS layer is going to check the
> > permissions of its inode and dentry, which will be incorrect if the mount
> > was mounted with a "gid" option.  
> 
> The gid option has nothing to do with this and it is just handled fine
> if you remove the second permission checking in (2).

I guess I'm confused to what you are having an issue with. Is it just
that the permission check gets called twice?

> 
> You need to remove the inode_permission() code from
> eventfs_start_creating(). It is just an internal lookup and the fact
> that you have it in there allows userspace to break readdir on the
> eventfs portions of tracefs as I've shown in the parts of the mail that
> you cut off.

That's because I didn't see how it was related to the way I fixed the
mount=gid issue. Are you only concerned because of the check in
eventfs_start_creating()?

Yes, you posted code that would make things act funny for some code
that I see no real world use case for. Yeah, it may not act "properly"
but I'm not sure that's bad.

Here, I'll paste it back:

> // We managed to open the directory so we have permission to list
> // directory entries in "xfs".
> fd = open("/sys/kernel/tracing/events/xfs");
> 
> // Remove ownership so we can't open the directory anymore
> chown("/sys/kernel/tracing/events/xfs", 0, 0);
> 
> // Or just remove exec bit for the group and restrict to owner
> chmod("/sys/kernel/tracing/events/xfs", 700);
> 
> // Drop caches to force an eventfs_root_lookup() on everything
> write("/proc/sys/vm/drop_caches", "3", 1);

This requires opening the directory, then having it's permissions
change, and then immediately dropping the caches.

> 
> // Returns 0 even though directory has a lot of entries and we should be
> // able to list them
> getdents64(fd, ...);

And do we care?

Since tracing exposes internal kernel information, perhaps this is a
feature and not a bug. If someone who had access to the tracing system
and you wanted to stop them, if they had a directory open that they no
longer have access to, you don't want them to see what's left in the
directory.

In other words, I like the idea that the getends64(fd, ...) will fail!

If there's a file underneath that wasn't change, and the admin thought
that just keeping the top directory permissions off is good enough,
then that attacker having that directory open before the directory had
it's file permissions change is a way to still have access to the files
below it.

> 
> And the failure is in the inode_permission(MAY_EXEC, "xfs") call in
> lookup_one_len() in eventfs_start_creating() which now fails.

And I think is a good thing!

Again, tracefs is special. It gives you access and possibly control to
the kernel behavior. I like the fact that as soon as someone loses
permission to a directory, they immediately lose it.

-- Steve

Re: [PATCH] tracefs/eventfs: Use root and instance inodes as default ownership

2024-01-12 Thread Christian Brauner

On Thu, Jan 11, 2024 at 04:53:19PM -0500, Steven Rostedt wrote:
> On Thu, 11 Jan 2024 22:01:32 +0100
> Christian Brauner  wrote:
> 
> > What I'm pointing out in the current logic is that the caller is
> > taxed twice:
> > 
> > (1) Once when the VFS has done inode_permission(MAY_EXEC, "xfs")
> > (2) And again when you call lookup_one_len() in eventfs_start_creating()
> > _because_ the permission check in lookup_one_len() is the exact
> > same permission check again that the vfs has done
> > inode_permission(MAY_EXEC, "xfs").
> 
> As I described in: 
> https://lore.kernel.org/all/20240110133154.6e18f...@gandalf.local.home/
> 
> The eventfs files below "events" doesn't need the .permissions callback at
> all. It's only there because the "events" inode uses it.
> 
> The .permissions call for eventfs has:

It doesn't matter whether there's a ->permission handler. If you don't
add one explicitly the VFS will simply call generic_permission():

inode_permission()
-> do_inode_permission()
   {
if (unlikely(!(inode->i_opflags & IOP_FASTPERM))) {
if (likely(inode->i_op->permission))
return inode->i_op->permission(idmap, inode, mask);
   
/* This gets set once for the inode lifetime */
spin_lock(>i_lock);
inode->i_opflags |= IOP_FASTPERM;
spin_unlock(>i_lock);
}
return generic_permission(idmap, inode, mask);
   }

> Anyway, the issue is with "events" directory and remounting, because like
> the tracefs system, the inode and dentry for "evnets" is created at boot
> up, before the mount happens. The VFS layer is going to check the
> permissions of its inode and dentry, which will be incorrect if the mount
> was mounted with a "gid" option.

The gid option has nothing to do with this and it is just handled fine
if you remove the second permission checking in (2).

You need to remove the inode_permission() code from
eventfs_start_creating(). It is just an internal lookup and the fact
that you have it in there allows userspace to break readdir on the
eventfs portions of tracefs as I've shown in the parts of the mail that
you cut off.

Re: [PATCH] tracefs/eventfs: Use root and instance inodes as default ownership

2024-01-11 Thread Steven Rostedt

On Thu, 11 Jan 2024 22:01:32 +0100
Christian Brauner  wrote:

> What I'm pointing out in the current logic is that the caller is
> taxed twice:
> 
> (1) Once when the VFS has done inode_permission(MAY_EXEC, "xfs")
> (2) And again when you call lookup_one_len() in eventfs_start_creating()
> _because_ the permission check in lookup_one_len() is the exact
> same permission check again that the vfs has done
> inode_permission(MAY_EXEC, "xfs").

As I described in: 
https://lore.kernel.org/all/20240110133154.6e18f...@gandalf.local.home/

The eventfs files below "events" doesn't need the .permissions callback at
all. It's only there because the "events" inode uses it.

The .permissions call for eventfs has:

static int eventfs_permission(struct mnt_idmap *idmap,
  struct inode *inode, int mask)
{
set_top_events_ownership(inode);
return generic_permission(idmap, inode, mask);
}

Where the "set_top_events_ownership() is a nop for everything but the
"events" directory.

I guess I could have two ops:

static const struct inode_operations eventfs_root_dir_inode_operations = {
.lookup = eventfs_root_lookup,
.setattr= eventfs_set_attr,
.getattr= eventfs_get_attr,
.permission = eventfs_permission,
};

static const struct inode_operations eventfs_dir_inode_operations = {
.lookup = eventfs_root_lookup,
.setattr= eventfs_set_attr,
.getattr= eventfs_get_attr,
};

And use the second one for all dentries below the root, but I figured it's
not that big of a deal if I called the permissions on all. Perhaps I should
do it with two?

Anyway, the issue is with "events" directory and remounting, because like
the tracefs system, the inode and dentry for "evnets" is created at boot
up, before the mount happens. The VFS layer is going to check the
permissions of its inode and dentry, which will be incorrect if the mount
was mounted with a "gid" option.

-- Steve

Re: [PATCH] tracefs/eventfs: Use root and instance inodes as default ownership

2024-01-11 Thread Christian Brauner

On Wed, Jan 10, 2024 at 08:07:46AM -0500, Steven Rostedt wrote:
> On Wed, 10 Jan 2024 12:45:36 +0100
> Christian Brauner  wrote:
> 
> > So say you do:
> > 
> > mkdir /sys/kernel/tracing/instances/foo
> > 
> > After this has returned we know everything we need to know about the new
> > tracefs instance including the ownership and the mode of all inodes in
> > /sys/kernel/tracing/instances/foo/events/* and below precisely because
> > ownership is always inherited from the parent dentry and recorded in the
> > metadata struct eventfs_inode.
> > 
> > So say someone does:
> > 
> > open("/sys/kernel/tracing/instances/foo/events/xfs");
> > 
> > and say this is the first time that someone accesses that events/
> > directory.
> > 
> > When the open pathwalk is done, the vfs will determine via
> > 
> > [1] may_lookup(inode_of(events))
> > 
> > whether you are able to list entries such as "xfs" in that directory.
> > The vfs checks inode_permission(MAY_EXEC) on "events" and if that holds
> > it ends up calling i_op->eventfs_root_lookup(events).
> > 
> > At this point tracefs/eventfs adds the inodes for all entries in that
> > "events" directory including "xfs" based on the metadata it recorded
> > during the mkdir. Since now someone is actually interested in them. And
> > it initializes the inodes with ownership and everything and adds the
> > dentries that belong into that directory.
> > 
> > Nothing here depends on the permissions of the caller. The only
> > permission that mattered was done in the VFS in [1]. If the caller has
> > permissions to enter a directory they can lookup and list its contents.
> > And its contents where determined/fixed etc when mkdir was called.
> > 
> > So we just need to add the required objects into the caches (inode,
> > dentry) whose addition we intentionally defered until someone actually
> > needed them.
> > 
> > So, eventfs_root_lookup() now initializes the inodes with the ownership
> > from the stored metadata or from the parent dentry and splices in inodes
> > and dentries. No permission checking is needed for this because it is
> > always a recheck of what the vfs did in [1].
> > 
> > We now return to the vfs and path walk continues to the final component
> > that you actually want to open which is that "xfs" directory in this
> > example. We check the permissions on that inode via may_open("xfs") and
> > we open that directory returning an fd to userspace ultimately.
> > 
> > (I'm going by memory since I need to step out the door.)
> 
> So, let's say we do:
> 
>  chgrp -R rostedt /sys/kernel/tracing/

The rostedt group needs exec permissions and "other" cannot have exec
permissions otherwise you can trivially list the entries even if it's
owned by root:

chmod 750 /sys/kernel/tracing

user1@localhost:~$ ls -aln /sys/kernel/ | grep tracing
drwxr-x---   6 0 10000 Jan 11 18:23 tracing

> 
> But I don't want rostedt to have access to xfs
> 
>  chgrp -R root /sys/kernel/tracing/events/xfs

chmod 750 /sys/kernel/tracing/events/xfs

user1@localhost:~$ ls -aln /sys/kernel/tracing/events/ | grep xfs
drwxr-x--- 601 0 0 0 Jan 11 18:24 xfs

This ensure that if a user is in the group and the group has exec perms
lookup is possible (For root this will usually work because
CAP_DAC_READ_SEARCH overrides the exec requirement.).

> 
> Both actions will create the inodes and dentries of all files and
> directories (because of "-R"). But once that is done, the ref counts go to
> zero. They stay around until reclaim. But then I open Chrome ;-) and it
> reclaims all the dentries and inodes, so we are back to here we were on
> boot.
> 
> Now as rostedt I do:
> 
>  ls /sys/kernel/tracing/events/xfs
> 
> The VFS layer doesn't know if I have permission to that or not, because all
> the inodes and dentries have been freed. It has to call back to eventfs to
> find out. Which the eventfs_root_lookup() and eventfs_iterate_shared() will
> recreated the inodes with the proper permission.

Very roughly, ignoring most of the complexity of lookup and focussing on
the permission checking:

When a caller looks up an entry in a directory then the VFS will call
inode_permission(MAY_EXEC) on the directory the caller is trying to
perform that lookup in.

If the caller wants to lookup the "events" entry in the "tracing"
directory then the VFS will call inode_permission(MAY_EXEC, "tracing")
and then - assuming it's not in the cache - call into the lookup method
of the filesystem.

After the VFS has determin

Re: [PATCH] tracefs/eventfs: Use root and instance inodes as default ownership

2024-01-10 Thread Steven Rostedt

On Wed, 10 Jan 2024 10:52:51 -0500
Steven Rostedt  wrote:

> On Wed, 10 Jan 2024 08:07:46 -0500
> Steven Rostedt  wrote:
> 
> > Or are you saying that I don't need the ".permission" callback, because
> > eventfs does it when it creates the inodes? But for eventfs to know what
> > the permissions changes are, it uses .getattr and .setattr.  
> 
> OK, if your main argument is that we do not need .permission, I agree with
> you. But that's a trivial change and doesn't affect the complexity that
> eventfs is doing. In fact, removing the "permission" check is simply this
> patch:
> 
> --
> diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c
> index fdff53d5a1f8..f2af07a857e2 100644
> --- a/fs/tracefs/event_inode.c
> +++ b/fs/tracefs/event_inode.c
> @@ -192,18 +192,10 @@ static int eventfs_get_attr(struct mnt_idmap *idmap,
>   return 0;
>  }
>  
> -static int eventfs_permission(struct mnt_idmap *idmap,
> -   struct inode *inode, int mask)
> -{
> - set_top_events_ownership(inode);
> - return generic_permission(idmap, inode, mask);
> -}
> -
>  static const struct inode_operations eventfs_root_dir_inode_operations = {
>   .lookup = eventfs_root_lookup,
>   .setattr= eventfs_set_attr,
>   .getattr= eventfs_get_attr,
> - .permission = eventfs_permission,
>  };
>  
>  static const struct inode_operations eventfs_file_inode_operations = {
> --
> 
> I only did that because Linus mentioned it, and I thought it was needed.
> I'll apply this patch too, as it appears to work with this code.

Oh, eventfs files and directories don't need the .permissions because its
inodes and dentries are not created until accessed. But the "events"
directory itself has its dentry and inode created at boot up, but still
uses the eventfs_root_dir_inode_operations. So the .permissions is still
needed!

If you look at the "set_top_events_ownership()" function, it has:

/* The top events directory doesn't get automatically updated */
if (!ei || !ei->is_events || !(ei->attr.mode & EVENTFS_TOPLEVEL))
return;

That is, it does nothing if the entry is not the "events" directory. It
falls back to he default "->permissions()" function for everything but the
top level "events" directory.

But this and .getattr are still needed for the events directory, because it
suffers the same issue as the other tracefs entries. That is, it's inodes
and dentries are created at boot up before it is mounted. So if the mount
has gid=1000, it will be ignored.

The .getattr is called by "stat" which ls does. So after boot up if you
just do:

 # chmod 0750 /sys/kernel/events
 # chmod 0770 /sys/kernel/tracing
 # mount -o remount,gid=1000 /sys/kernel/tracing
 # su - rostedt
 $ id
uid=1000(rostedt) gid=1000(rostedt) groups=1000(rostedt)
 $ ls /sys/kernel/tracing/events/
9pext4iomapmodule  raw_syscalls  thermal
alarmtimerfib iommumsr rcu   thp
avc   fib6io_uring napiregmaptimer
block filelockipi  neigh   regulator tlb
bpf_test_run  filemap irq  net resctrl   udp
bpf_trace ftrace  irq_matrix   netfs   rpm   virtio_gpu[
...]

The above works because "ls" does a stat() on the directory first, which
does a .getattr() call that updates the permissions of the existing "events"
directory inode.

  BUT!

If I had used my own getents() program that has:

fd = openat(AT_FDCWD, argv[1], O_RDONLY);
if (fd < 0)
perror("openat");

n = getdents64(fd, buf, BUF_SIZE);
if (n < 0)
perror("getdents64");

Where it calls the openat() without doing a stat fist, and after boot, had done:

 # chmod 0750 /sys/kernel/events
 # chmod 0770 /sys/kernel/tracing
 # mount -o remount,gid=1000 /sys/kernel/tracing
 # su - rostedt
 $ id
uid=1000(rostedt) gid=1000(rostedt) groups=1000(rostedt)
 $ ./getdents /sys/kernel/tracing/events
openat: Permission denied
getdents64: Bad file descriptor

It errors because he "events" inode permission hasn't been updated yet.
Now after getting the above error, if I do the "ls" and then run it again:

 $ ls /sys/kernel/tracing/events > /dev/null
 $ ./getdents /sys/kernel/tracing/events
enable
header_page
header_event
initcall
vsyscall
syscalls

it works!

so no, I can't remove that .permissions callback from eventfs.

-- Steve

Re: [PATCH] tracefs/eventfs: Use root and instance inodes as default ownership

2024-01-10 Thread Steven Rostedt

On Wed, 10 Jan 2024 10:52:51 -0500
Steven Rostedt  wrote:

> I'll apply this patch too, as it appears to work with this code.

I meant "appears to work without this code".

-- Steve

Re: [PATCH] tracefs/eventfs: Use root and instance inodes as default ownership

2024-01-10 Thread Steven Rostedt

On Wed, 10 Jan 2024 08:07:46 -0500
Steven Rostedt  wrote:

> Or are you saying that I don't need the ".permission" callback, because
> eventfs does it when it creates the inodes? But for eventfs to know what
> the permissions changes are, it uses .getattr and .setattr.

OK, if your main argument is that we do not need .permission, I agree with
you. But that's a trivial change and doesn't affect the complexity that
eventfs is doing. In fact, removing the "permission" check is simply this
patch:

--
diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c
index fdff53d5a1f8..f2af07a857e2 100644
--- a/fs/tracefs/event_inode.c
+++ b/fs/tracefs/event_inode.c
@@ -192,18 +192,10 @@ static int eventfs_get_attr(struct mnt_idmap *idmap,
return 0;
 }
 
-static int eventfs_permission(struct mnt_idmap *idmap,
- struct inode *inode, int mask)
-{
-   set_top_events_ownership(inode);
-   return generic_permission(idmap, inode, mask);
-}
-
 static const struct inode_operations eventfs_root_dir_inode_operations = {
.lookup = eventfs_root_lookup,
.setattr= eventfs_set_attr,
.getattr= eventfs_get_attr,
-   .permission = eventfs_permission,
 };
 
 static const struct inode_operations eventfs_file_inode_operations = {
--

I only did that because Linus mentioned it, and I thought it was needed.
I'll apply this patch too, as it appears to work with this code.

Thanks!

-- Steve

Re: [PATCH] tracefs/eventfs: Use root and instance inodes as default ownership

2024-01-10 Thread Steven Rostedt

On Wed, 10 Jan 2024 12:45:36 +0100
Christian Brauner  wrote:

> So say you do:
> 
> mkdir /sys/kernel/tracing/instances/foo
> 
> After this has returned we know everything we need to know about the new
> tracefs instance including the ownership and the mode of all inodes in
> /sys/kernel/tracing/instances/foo/events/* and below precisely because
> ownership is always inherited from the parent dentry and recorded in the
> metadata struct eventfs_inode.
> 
> So say someone does:
> 
> open("/sys/kernel/tracing/instances/foo/events/xfs");
> 
> and say this is the first time that someone accesses that events/
> directory.
> 
> When the open pathwalk is done, the vfs will determine via
> 
> [1] may_lookup(inode_of(events))
> 
> whether you are able to list entries such as "xfs" in that directory.
> The vfs checks inode_permission(MAY_EXEC) on "events" and if that holds
> it ends up calling i_op->eventfs_root_lookup(events).
> 
> At this point tracefs/eventfs adds the inodes for all entries in that
> "events" directory including "xfs" based on the metadata it recorded
> during the mkdir. Since now someone is actually interested in them. And
> it initializes the inodes with ownership and everything and adds the
> dentries that belong into that directory.
> 
> Nothing here depends on the permissions of the caller. The only
> permission that mattered was done in the VFS in [1]. If the caller has
> permissions to enter a directory they can lookup and list its contents.
> And its contents where determined/fixed etc when mkdir was called.
> 
> So we just need to add the required objects into the caches (inode,
> dentry) whose addition we intentionally defered until someone actually
> needed them.
> 
> So, eventfs_root_lookup() now initializes the inodes with the ownership
> from the stored metadata or from the parent dentry and splices in inodes
> and dentries. No permission checking is needed for this because it is
> always a recheck of what the vfs did in [1].
> 
> We now return to the vfs and path walk continues to the final component
> that you actually want to open which is that "xfs" directory in this
> example. We check the permissions on that inode via may_open("xfs") and
> we open that directory returning an fd to userspace ultimately.
> 
> (I'm going by memory since I need to step out the door.)

So, let's say we do:

 chgrp -R rostedt /sys/kernel/tracing/

But I don't want rostedt to have access to xfs

 chgrp -R root /sys/kernel/tracing/events/xfs

Both actions will create the inodes and dentries of all files and
directories (because of "-R"). But once that is done, the ref counts go to
zero. They stay around until reclaim. But then I open Chrome ;-) and it
reclaims all the dentries and inodes, so we are back to here we were on
boot.

Now as rostedt I do:

 ls /sys/kernel/tracing/events/xfs

The VFS layer doesn't know if I have permission to that or not, because all
the inodes and dentries have been freed. It has to call back to eventfs to
find out. Which the eventfs_root_lookup() and eventfs_iterate_shared() will
recreated the inodes with the proper permission.

Or are you saying that I don't need the ".permission" callback, because
eventfs does it when it creates the inodes? But for eventfs to know what
the permissions changes are, it uses .getattr and .setattr.

-- Steve

Re: [PATCH] tracefs/eventfs: Use root and instance inodes as default ownership

2024-01-10 Thread Christian Brauner

On Mon, Jan 08, 2024 at 10:23:31AM -0500, Steven Rostedt wrote:
> On Mon, 8 Jan 2024 12:04:54 +0100
> Christian Brauner  wrote:
> 
> > > > IOW, the inode_permission() in lookup_one_len() that eventfs does is
> > > > redundant and just wrong.  
> > > 
> > > I don't think so.  
> > 
> > I'm very well aware that the dentries and inode aren't created during
> > mkdir but the completely directory layout is determined. You're just
> > splicing in dentries and inodes during lookup and readdir.
> > 
> > If mkdir /sys/kernel/tracing/instances/foo has succeeded and you later
> > do a lookup/readdir on
> > 
> > ls -al /sys/kernel/tracing/instances/foo/events
> > 
> > Why should the creation of the dentries and inodes ever fail due to a
> > permission failure?
> 
> They shouldn't.
> 
> > The vfs did already verify that you had the required
> > permissions to list entries in that directory. Why should filling up
> > /sys/kernel/tracing/instances/foo/events ever fail then? It shouldn't
> > That tracefs instance would be half-functional. And again, right now
> > that inode_permission() check cannot even fail.
> 
> And it shouldn't. But without dentries and inodes, how does VFS know what
> is allowed to open the files?

So say you do:

mkdir /sys/kernel/tracing/instances/foo

After this has returned we know everything we need to know about the new
tracefs instance including the ownership and the mode of all inodes in
/sys/kernel/tracing/instances/foo/events/* and below precisely because
ownership is always inherited from the parent dentry and recorded in the
metadata struct eventfs_inode.

So say someone does:

open("/sys/kernel/tracing/instances/foo/events/xfs");

and say this is the first time that someone accesses that events/
directory.

When the open pathwalk is done, the vfs will determine via

[1] may_lookup(inode_of(events))

whether you are able to list entries such as "xfs" in that directory.
The vfs checks inode_permission(MAY_EXEC) on "events" and if that holds
it ends up calling i_op->eventfs_root_lookup(events).

At this point tracefs/eventfs adds the inodes for all entries in that
"events" directory including "xfs" based on the metadata it recorded
during the mkdir. Since now someone is actually interested in them. And
it initializes the inodes with ownership and everything and adds the
dentries that belong into that directory.

Nothing here depends on the permissions of the caller. The only
permission that mattered was done in the VFS in [1]. If the caller has
permissions to enter a directory they can lookup and list its contents.
And its contents where determined/fixed etc when mkdir was called.

So we just need to add the required objects into the caches (inode,
dentry) whose addition we intentionally defered until someone actually
needed them.

So, eventfs_root_lookup() now initializes the inodes with the ownership
from the stored metadata or from the parent dentry and splices in inodes
and dentries. No permission checking is needed for this because it is
always a recheck of what the vfs did in [1].

We now return to the vfs and path walk continues to the final component
that you actually want to open which is that "xfs" directory in this
example. We check the permissions on that inode via may_open("xfs") and
we open that directory returning an fd to userspace ultimately.

(I'm going by memory since I need to step out the door.)

Re: [PATCH] tracefs/eventfs: Use root and instance inodes as default ownership

2024-01-08 Thread Steven Rostedt

On Mon, 8 Jan 2024 12:32:46 +0100
Christian Brauner  wrote:

> On Sun, Jan 07, 2024 at 01:32:28PM -0500, Steven Rostedt wrote:
> > On Sun, 7 Jan 2024 13:29:12 -0500
> > Steven Rostedt  wrote:
> >   
> > > > 
> > > > IOW, the inode_permission() in lookup_one_len() that eventfs does is
> > > > redundant and just wrong.
> > > 
> > > I don't think so.  
> > 
> > Just to make it clear. eventfs has nothing to do with mkdir instance/foo.
> > It exists without that. Although one rationale to do eventfs was so  
> 
> Every instance/foo/ tracefs instances also contains an events directory
> and thus a eventfs portion. Eventfs is just a subtree of tracefs. It's
> not a separate filesystem. Both eventfs and tracefs are on the same
> single, system wide superblock.
> 
> > that the instance directories wouldn't recreate the same 10thousands
> > event inodes and dentries for every mkdir done.  
> 
> I know but that's irrelevant to what I'm trying to tell you.
> 
> A mkdir /sys/kernel/tracing/instances/foo creates a new tracefs
> instance. With or without the on-demand dentry and inode creation for
> the eventfs portion that tracefs "instance" has now been created in its
> entirety including all the required information for someone to later
> come along and perform a lookup on /sys/kernel/tracing/instances/foo/events.
> 
> All you've done is to defer the addition of the dentries and inodes when
> someone does actually look at the events directory of the tracefs
> instance.
> 
> Whether you choose to splice in the dentries and inodes for the eventfs
> portion during lookup and readdir or if you had chosen to not do the
> on-demand thing at all and the entries were created at the same time as
> the mkdir call are equivalent from the perspective of permission
> checking.
> 
> If you have the required permissions to look at the events directory
> then there's no reason why listing the directory entries in there should
> fail. This can't even happen right now.

Ah, I think I know where the confusion lies. The tracing information in
kernel/trace/*.c doesn't keep track of permission. It relies totally on
fs/tracefs/* to do so. If someone does 'chmod' or 'chown' or mounts with
'gid=xxx' then it's up to tracefs to maintain that information and not the
tracing subsystem. The tracing subsystem only gives the "default"
permissions (before boot finishes).

The difference between normal file systems and pseudo file systems like
debugfs and tracefs, is that normal file systems keep the permission
information stored on the external device. That is, when the inodes and
dentries are created, the information is retrieved from the stored file
system.

I think this may actually be a failure of debugfs (and tracefs as it was
based on debugfs), in that the inodes and dentries are created at the same
time the "files" backing them are. Which is normally at boot up and before
the file system is mounted.

That is, inodes and dentries are actually coupled with the data they
represent. It's not a cache for a back store like a hard drive partition.

To create a file in debugfs you do:

struct dentry *debugfs_create_file(const char *name, umode_t mode,
   struct dentry *parent, void *data,
   const struct file_operations *fops)

That is, you pass a the name, the mode, the parent dentry, data, and the
fops and that will create an inode and dentry (which is returned).

This happens at boot up before user space is running and before debugfs is
even mounted.

Because debugfs is mostly for debugging, people don't care about how it's
mounted. It is usually restricted to root only access. Especially since
there's a lot of sensitive information that shouldn't be exposed to
non-privileged users.

The reason tracefs came about is that people asked me to be able to have
access to tracing without needing to even enable debugfs. They also want to
easily make it accessible to non root users and talking with Kees Cook, he
recommended using ACL. But because it inherited a lot from debugfs, I
started doing these tricks like walking the dentry tree to make it work a
bit better. Because the dentries and inodes were created before mount, I
had to play these tricks.

But as Linus pointed out, that was the wrong way to do that. The right way
was to use .getattr and .permission callbacks to figure out what the
permissions to the files are.

This has nothing to do with the creation of the files, it's about who has
access to the files that the inodes point to.

This sounds like another topic to bring up at LSFMM ;-)  "Can we
standardize pseudo file systems like debugfs and tracefs to act more like
real file systems, and have inodes and dentries act as cache and not be so
coupled to the data?"

-- Steve

Re: [PATCH] tracefs/eventfs: Use root and instance inodes as default ownership

2024-01-08 Thread Steven Rostedt

On Mon, 8 Jan 2024 12:04:54 +0100
Christian Brauner  wrote:

> > > IOW, the inode_permission() in lookup_one_len() that eventfs does is
> > > redundant and just wrong.  
> > 
> > I don't think so.  
> 
> I'm very well aware that the dentries and inode aren't created during
> mkdir but the completely directory layout is determined. You're just
> splicing in dentries and inodes during lookup and readdir.
> 
> If mkdir /sys/kernel/tracing/instances/foo has succeeded and you later
> do a lookup/readdir on
> 
> ls -al /sys/kernel/tracing/instances/foo/events
> 
> Why should the creation of the dentries and inodes ever fail due to a
> permission failure?

They shouldn't.

> The vfs did already verify that you had the required
> permissions to list entries in that directory. Why should filling up
> /sys/kernel/tracing/instances/foo/events ever fail then? It shouldn't
> That tracefs instance would be half-functional. And again, right now
> that inode_permission() check cannot even fail.

And it shouldn't. But without dentries and inodes, how does VFS know what
is allowed to open the files?

-- Steve

Re: [PATCH] tracefs/eventfs: Use root and instance inodes as default ownership

2024-01-08 Thread Christian Brauner

On Sun, Jan 07, 2024 at 01:32:28PM -0500, Steven Rostedt wrote:
> On Sun, 7 Jan 2024 13:29:12 -0500
> Steven Rostedt  wrote:
> 
> > > 
> > > IOW, the inode_permission() in lookup_one_len() that eventfs does is
> > > redundant and just wrong.  
> > 
> > I don't think so.
> 
> Just to make it clear. eventfs has nothing to do with mkdir instance/foo.
> It exists without that. Although one rationale to do eventfs was so

Every instance/foo/ tracefs instances also contains an events directory
and thus a eventfs portion. Eventfs is just a subtree of tracefs. It's
not a separate filesystem. Both eventfs and tracefs are on the same
single, system wide superblock.

> that the instance directories wouldn't recreate the same 10thousands
> event inodes and dentries for every mkdir done.

I know but that's irrelevant to what I'm trying to tell you.

A mkdir /sys/kernel/tracing/instances/foo creates a new tracefs
instance. With or without the on-demand dentry and inode creation for
the eventfs portion that tracefs "instance" has now been created in its
entirety including all the required information for someone to later
come along and perform a lookup on /sys/kernel/tracing/instances/foo/events.

All you've done is to defer the addition of the dentries and inodes when
someone does actually look at the events directory of the tracefs
instance.

Whether you choose to splice in the dentries and inodes for the eventfs
portion during lookup and readdir or if you had chosen to not do the
on-demand thing at all and the entries were created at the same time as
the mkdir call are equivalent from the perspective of permission
checking.

If you have the required permissions to look at the events directory
then there's no reason why listing the directory entries in there should
fail. This can't even happen right now.

Re: [PATCH] tracefs/eventfs: Use root and instance inodes as default ownership

2024-01-08 Thread Christian Brauner

> > * Tracefs supports the creation of instances from userspace via mkdir.
> >   For example,
> > 
> > mkdir /sys/kernel/tracing/instances/foo
> > 
> >   And here the idmapping is relevant so we need to make the helpers
> >   aware of the idmapping.
> > 
> >   I just went and plumbed this through to most helpers.
> > 
> > There's some subtlety in eventfs. Afaict, the directories and files for
> > the individual events are created on-demand during lookup or readdir.
> > 
> > The ownership of these events is again inherited from the parent inode
> > or recovered from stored state. In both cases the actual idmapping is
> > irrelevant.
> > 
> > The callchain here is:
> > 
> > eventfs_root_lookup("xfs", "events")
> > -> create_{dir,file}_dentry("xfs", "events")
> >-> create_{dir,file}("xfs", "events")
> >   -> eventfs_start_creating("xfs", "events")
> >  -> lookup_one_len("xfs", "events")  
> > 
> > And the subtlety is that lookup_one_len() does permission checking on
> > the parent inode (IOW, if you want a dentry for "blech" under "events"
> > it'll do a permission check on events->d_inode) for exec permissions
> > and then goes on to give you a new dentry.
> > 
> > Usually this call would have to be changed to lookup_one() and the
> > idmapping be handed down to it. But I think that's irrelevant here.
> > 
> > Lookup generally doesn't need to be aware of idmappings at all. The
> > permission checking is done purely in the vfs via may_lookup() and the
> > idmapping is irrelevant because we always initialize inodes with the
> > filesystem level ownership (see the idmappings.rst) documentation if
> > you're interested in excessive details (otherwise you get inode aliases
> > which you really don't want).
> > 
> > For tracefs it would not matter for lookup per se but only because
> > tracefs seemingly creates inodes/dentries during lookup (and readdir()).
> 
> tracefs creates the inodes/dentries at boot up, it's eventfs that does
> it on demand during lookup.
> 
> For inodes/dentries:
> 
>  /sys/kernel/tracing/* is all created at boot up, except for "events".

Yes.

>  /sys/kernel/tracing/events/* is created on demand.

Yes.

> 
> > 
> > But imho the permission checking done in current eventfs_root_lookup()
> > via lookup_one_len() is meaningless in any way; possibly even
> > (conceptually) wrong.
> > 
> > Because, the actual permission checking for the creation of the eventfs
> > entries isn't really done during lookup or readdir, it's done when mkdir
> > is called:
> > 
> > mkdir /sys/kernel/tracing/instances/foo
> 
> No. that creates a entire new tracefs instance, which happens to
> include another eventfs directory.

Yes, I'm aware of all that.

> No. Only the meta data is created for the eventfs directory with a
> mkdir instances/foo. The inodes and dentries are not there.

I know, that is what I'm saying.

> 
> > 
> > When one goes and looksup stuff under foo/events/ or readdir the entries
> > in that directory:
> > 
> > fd = open("foo/events")
> > readdir(fd, ...)
> > 
> > then they are licensed to list an entry in that directory. So all that
> > needs to be done is to actually list those files in that directory. And
> > since they already exist (they were created during mkdir) we just need
> > to splice in inodes and dentries for them. But for that we shouldn't
> > check permissions on the directory again. Because we've done that
> > already correctly when the VFS called may_lookup().
> 
> No they do not exist.

I am aware.

> 
> > 
> > IOW, the inode_permission() in lookup_one_len() that eventfs does is
> > redundant and just wrong.
> 
> I don't think so.

I'm very well aware that the dentries and inode aren't created during
mkdir but the completely directory layout is determined. You're just
splicing in dentries and inodes during lookup and readdir.

If mkdir /sys/kernel/tracing/instances/foo has succeeded and you later
do a lookup/readdir on

ls -al /sys/kernel/tracing/instances/foo/events

Why should the creation of the dentries and inodes ever fail due to a
permission failure? The vfs did already verify that you had the required
permissions to list entries in that directory. Why should filling up
/sys/kernel/tracing/instances/foo/events ever fail then? It shouldn't
That tracefs instance would be half-functional. And again, right now
that inode_permission() check cannot even fail.

Re: [PATCH] tracefs/eventfs: Use root and instance inodes as default ownership

2024-01-07 Thread Steven Rostedt

On Sun, 7 Jan 2024 13:29:12 -0500
Steven Rostedt  wrote:

> > 
> > IOW, the inode_permission() in lookup_one_len() that eventfs does is
> > redundant and just wrong.  
> 
> I don't think so.

Just to make it clear. eventfs has nothing to do with mkdir instance/foo.
It exists without that. Although one rationale to do eventfs was so
that the instance directories wouldn't recreate the same 10thousands
event inodes and dentries for every mkdir done.

-- Steve

Re: [PATCH] tracefs/eventfs: Use root and instance inodes as default ownership

2024-01-07 Thread Steven Rostedt

On Sun, 7 Jan 2024 13:42:39 +0100
Christian Brauner  wrote:
> 
> So, I tried to do an exploratory patch even though I promised myself not
> to do it. But hey...
> 
> Some notes:
> 
> * Permission handling for idmapped mounts is done completely in the
>   VFS. That's the case for all filesytems that don't have a custom
>   ->permission() handler. So there's nothing to do for us here.  
> 
> * Idmapped mount handling for ->getattr() is done completely by the VFS
>   if the filesystem doesn't have a custom ->getattr() handler. So we're
>   done here.
> 
> * Tracefs doesn't support attribute changes via ->setattr() (chown(),
>   chmod etc.). So there's nothing to here.
> 
> * Eventfs does support attribute changes via ->setattr(). But it relies
>   on simple_setattr() which is already idmapped mount aware. So there's
>   nothing for us to do.
> 
> * Ownership is inherited from the parent inode (tracefs) or optionally
>   from stashed ownership information (eventfs). That means the idmapping
>   is irrelevant here. It's similar to the "inherit gid from parent
>   directory" logic we have in some circumstances. TL;DR nothing to do
>   here as well.

The reason ownership is inherited from the parent is because the inodes
are created at boot up before user space starts.

eventfs inodes are created on demand after user space so it needs to
either check the "default" ownership and permissions, or if the user
changed a specific file/directory, it must save it and use that again
if the inode/dentry are reclaimed and then referenced again and
recreated.

> 
> * Tracefs supports the creation of instances from userspace via mkdir.
>   For example,
> 
>   mkdir /sys/kernel/tracing/instances/foo
> 
>   And here the idmapping is relevant so we need to make the helpers
>   aware of the idmapping.
> 
>   I just went and plumbed this through to most helpers.
> 
> There's some subtlety in eventfs. Afaict, the directories and files for
> the individual events are created on-demand during lookup or readdir.
> 
> The ownership of these events is again inherited from the parent inode
> or recovered from stored state. In both cases the actual idmapping is
> irrelevant.
> 
> The callchain here is:
> 
> eventfs_root_lookup("xfs", "events")
> -> create_{dir,file}_dentry("xfs", "events")
>-> create_{dir,file}("xfs", "events")
>   -> eventfs_start_creating("xfs", "events")
>  -> lookup_one_len("xfs", "events")  
> 
> And the subtlety is that lookup_one_len() does permission checking on
> the parent inode (IOW, if you want a dentry for "blech" under "events"
> it'll do a permission check on events->d_inode) for exec permissions
> and then goes on to give you a new dentry.
> 
> Usually this call would have to be changed to lookup_one() and the
> idmapping be handed down to it. But I think that's irrelevant here.
> 
> Lookup generally doesn't need to be aware of idmappings at all. The
> permission checking is done purely in the vfs via may_lookup() and the
> idmapping is irrelevant because we always initialize inodes with the
> filesystem level ownership (see the idmappings.rst) documentation if
> you're interested in excessive details (otherwise you get inode aliases
> which you really don't want).
> 
> For tracefs it would not matter for lookup per se but only because
> tracefs seemingly creates inodes/dentries during lookup (and readdir()).

tracefs creates the inodes/dentries at boot up, it's eventfs that does
it on demand during lookup.

For inodes/dentries:

 /sys/kernel/tracing/* is all created at boot up, except for "events".
 /sys/kernel/tracing/events/* is created on demand.

> 
> But imho the permission checking done in current eventfs_root_lookup()
> via lookup_one_len() is meaningless in any way; possibly even
> (conceptually) wrong.
> 
> Because, the actual permission checking for the creation of the eventfs
> entries isn't really done during lookup or readdir, it's done when mkdir
> is called:
> 
> mkdir /sys/kernel/tracing/instances/foo

No. that creates a entire new tracefs instance, which happens to
include another eventfs directory.

The eventsfs directory is in "/sys/kernel/tracing/events" and
 "/sys/kernel/tracing/instances/*/events"

eventfs has 10s of thousands of files and directories which is why I
changed it to be on-demand inode/dentry creation and also reclaiming
when no longer accessed.

 # ls /sys/kernel/tracing/events/

will create the inodes and dentries, and a memory stress program will
free those created inodes and dentries.

> 
> Here, all po

Re: [PATCH] tracefs/eventfs: Use root and instance inodes as default ownership

2024-01-07 Thread Christian Brauner

On Sun, Jan 07, 2024 at 06:42:33PM +0100, Christian Brauner wrote:
> On Sun, Jan 07, 2024 at 01:42:39PM +0100, Christian Brauner wrote:
> > > > So tracefs supports remounting with different uid/gid mount options and
> > > > then actually wades through _all_ of the inodes and changes their
> > > > ownership internally? What's the use-case for this? Containers?
> > > 
> > > No, in fact tracing doesn't work well with containers as tracing is global
> > > to the entire machine. It can work with privileged containers though.
> > 
> > At least the tracefs interface is easily supportable within a delegation
> > model. IOW, you have a privileged process that delegates relevant
> > portions to a container via idmapped mounts _without_ doing the insane thing
> > and making it mountable by a container aka the fs-to-CVE pipeline.
> > 
> > > 
> > > The reason for this is because tracefs was based off of debugfs where the
> > > files and directores are created at boot up and mounted later. The reason
> > > to do this was to allow users to mount with gid=GID to allow a given group
> > > to have access to tracing. Without this update, tracefs would ignore it
> > > like debugfs and proc does today.
> > > 
> > > I think its time I explain the purpose of tracefs and how it came to be.
> > > 
> > > The tracing system required a way to control tracing and read the traces.
> > > It could have just used a new system like perf (although
> > > /sys/kernel/debug/tracing predates perf), where it created a single 
> > > ioctl()
> > > like system call do do everything.
> > > 
> > > As the ftrace tracing came from PREEMPT_RT latency tracer and my own 
> > > logdev
> > > tracer, which both have an embedded background, I chose an interface that
> > > could work with just an unmodified version of busybox. That is, I wanted 
> > > it
> > > to work with just cat and echo.
> > > 
> > > The main difference with tracefs compared to other file systems is that it
> > > is a control interface, where writes happen as much as reads. The data 
> > > read
> > > is controlled. The closest thing I can think of is how cgroups work.
> > > 
> > > As tracing is a privileged operation, but something that could be changed
> > > to allow a group to have access to, I wanted to make it easy for an admin
> > > to decide who gets to do what at boot up via the /etc/fstab file.
> > 
> > Yeah, ok. I think you could achieve the same thing via idmapped mounts. You
> > just need to swap out the mnt on /sys/kernel/tracing with an idmapped mount.
> > 
> > mount(8) should just give you the ability to specify "map the ids I 
> > explicitly
> > want to remap to something else and for the rest use the identity mapping". 
> > I
> > wanted that for other reasons anyway.
> > 
> > So in one of the next versions of mount(8) you can then do (where --beneath
> > means place the mount beneath the current one and --replace is
> > self-explanatory):
> > 
> > sudo mount --beneath -o X-mount.idmap='g:0:1234:1 u:0:0:1' 
> > /sys/kernel/tracing
> > sudo umount /sys/kernel/tracing
> > 
> > or as a shortcut provided by mount(8):
> > 
> > sudo mount --replace -o X-mount.idmap='g:0:1234:1 u:0:0:1' 
> > /sys/kernel/tracing 
> > 
> > In both cases you replace the mount without unmounting tracefs.
> > 
> > I can illustrate this right now though:
> > 
> > user1@localhost:~$ sudo mount --bind -o X-mount.idmap='g:0:1000:1 
> > u:0:1000:1' /sys/kernel/tracing/ /mnt/
> > 
> > # This is a tool I wrote for testing the patchset I wrote back then.
> > user1@localhost:~/data/move-mount-beneath$ sudo ./move-mount --beneath 
> > --detached /mnt /sys/kernel/tracing
> > Mounting beneath top mount
> > Creating anonymous mount
> > Attaching mount /mnt -> /sys/kernel/tracing
> > Creating single detached mount
> > 
> > user1@localhost:~/data/move-mount-beneath$
> > 
> > # Now there's two mounts stacked on top of each other.
> > user1@localhost:~/data/move-mount-beneath$ findmnt | grep tracing
> > | `-/sys/kernel/tracingtracefstracefs 
> > rw,nosuid,nodev,noexec,relatime,idmapped
> > |   `-/sys/kernel/tracing  tracefstracefs 
> > rw,nosuid,nodev,noexec,relatime
> > 
> > user1@localhost:~/data/move-mount-beneath$ sudo ls -al 
> > /sys/kernel/tracing/| head
> > total 0
> > drwx--  6 root root 0 J

Re: [PATCH] tracefs/eventfs: Use root and instance inodes as default ownership

2024-01-07 Thread Christian Brauner

On Sun, Jan 07, 2024 at 01:42:39PM +0100, Christian Brauner wrote:
> > > So tracefs supports remounting with different uid/gid mount options and
> > > then actually wades through _all_ of the inodes and changes their
> > > ownership internally? What's the use-case for this? Containers?
> > 
> > No, in fact tracing doesn't work well with containers as tracing is global
> > to the entire machine. It can work with privileged containers though.
> 
> At least the tracefs interface is easily supportable within a delegation
> model. IOW, you have a privileged process that delegates relevant
> portions to a container via idmapped mounts _without_ doing the insane thing
> and making it mountable by a container aka the fs-to-CVE pipeline.
> 
> > 
> > The reason for this is because tracefs was based off of debugfs where the
> > files and directores are created at boot up and mounted later. The reason
> > to do this was to allow users to mount with gid=GID to allow a given group
> > to have access to tracing. Without this update, tracefs would ignore it
> > like debugfs and proc does today.
> > 
> > I think its time I explain the purpose of tracefs and how it came to be.
> > 
> > The tracing system required a way to control tracing and read the traces.
> > It could have just used a new system like perf (although
> > /sys/kernel/debug/tracing predates perf), where it created a single ioctl()
> > like system call do do everything.
> > 
> > As the ftrace tracing came from PREEMPT_RT latency tracer and my own logdev
> > tracer, which both have an embedded background, I chose an interface that
> > could work with just an unmodified version of busybox. That is, I wanted it
> > to work with just cat and echo.
> > 
> > The main difference with tracefs compared to other file systems is that it
> > is a control interface, where writes happen as much as reads. The data read
> > is controlled. The closest thing I can think of is how cgroups work.
> > 
> > As tracing is a privileged operation, but something that could be changed
> > to allow a group to have access to, I wanted to make it easy for an admin
> > to decide who gets to do what at boot up via the /etc/fstab file.
> 
> Yeah, ok. I think you could achieve the same thing via idmapped mounts. You
> just need to swap out the mnt on /sys/kernel/tracing with an idmapped mount.
> 
> mount(8) should just give you the ability to specify "map the ids I explicitly
> want to remap to something else and for the rest use the identity mapping". I
> wanted that for other reasons anyway.
> 
> So in one of the next versions of mount(8) you can then do (where --beneath
> means place the mount beneath the current one and --replace is
> self-explanatory):
> 
> sudo mount --beneath -o X-mount.idmap='g:0:1234:1 u:0:0:1' /sys/kernel/tracing
> sudo umount /sys/kernel/tracing
> 
> or as a shortcut provided by mount(8):
> 
> sudo mount --replace -o X-mount.idmap='g:0:1234:1 u:0:0:1' 
> /sys/kernel/tracing 
> 
> In both cases you replace the mount without unmounting tracefs.
> 
> I can illustrate this right now though:
> 
> user1@localhost:~$ sudo mount --bind -o X-mount.idmap='g:0:1000:1 u:0:1000:1' 
> /sys/kernel/tracing/ /mnt/
> 
> # This is a tool I wrote for testing the patchset I wrote back then.
> user1@localhost:~/data/move-mount-beneath$ sudo ./move-mount --beneath 
> --detached /mnt /sys/kernel/tracing
> Mounting beneath top mount
> Creating anonymous mount
> Attaching mount /mnt -> /sys/kernel/tracing
> Creating single detached mount
> 
> user1@localhost:~/data/move-mount-beneath$
> 
> # Now there's two mounts stacked on top of each other.
> user1@localhost:~/data/move-mount-beneath$ findmnt | grep tracing
> | `-/sys/kernel/tracingtracefstracefs 
> rw,nosuid,nodev,noexec,relatime,idmapped
> |   `-/sys/kernel/tracing  tracefstracefs 
> rw,nosuid,nodev,noexec,relatime
> 
> user1@localhost:~/data/move-mount-beneath$ sudo ls -al /sys/kernel/tracing/| 
> head
> total 0
> drwx--  6 root root 0 Jan  7 13:33 .
> drwxr-xr-x 16 root root 0 Jan  7 13:33 ..
> -r--r-  1 root root 0 Jan  7 13:33 README
> -r--r-  1 root root 0 Jan  7 13:33 available_events
> -r--r-  1 root root 0 Jan  7 13:33 available_filter_functions
> -r--r-  1 root root 0 Jan  7 13:33 available_filter_functions_addrs
> -r--r-  1 root root 0 Jan  7 13:33 available_tracers
> -rw-r-  1 root root 0 Jan  7 13:33 buffer_percent
> -rw-r-  1 root root 0 Jan  7 13:33 buffer_size_kb
> 
> # Reveal updated mount
> user1@localhost:~/data/move-mount-beneath$ sudo umount /sys/k

Re: [PATCH] tracefs/eventfs: Use root and instance inodes as default ownership

2024-01-07 Thread Christian Brauner

> > So tracefs supports remounting with different uid/gid mount options and
> > then actually wades through _all_ of the inodes and changes their
> > ownership internally? What's the use-case for this? Containers?
> 
> No, in fact tracing doesn't work well with containers as tracing is global
> to the entire machine. It can work with privileged containers though.

At least the tracefs interface is easily supportable within a delegation
model. IOW, you have a privileged process that delegates relevant
portions to a container via idmapped mounts _without_ doing the insane thing
and making it mountable by a container aka the fs-to-CVE pipeline.

> 
> The reason for this is because tracefs was based off of debugfs where the
> files and directores are created at boot up and mounted later. The reason
> to do this was to allow users to mount with gid=GID to allow a given group
> to have access to tracing. Without this update, tracefs would ignore it
> like debugfs and proc does today.
> 
> I think its time I explain the purpose of tracefs and how it came to be.
> 
> The tracing system required a way to control tracing and read the traces.
> It could have just used a new system like perf (although
> /sys/kernel/debug/tracing predates perf), where it created a single ioctl()
> like system call do do everything.
> 
> As the ftrace tracing came from PREEMPT_RT latency tracer and my own logdev
> tracer, which both have an embedded background, I chose an interface that
> could work with just an unmodified version of busybox. That is, I wanted it
> to work with just cat and echo.
> 
> The main difference with tracefs compared to other file systems is that it
> is a control interface, where writes happen as much as reads. The data read
> is controlled. The closest thing I can think of is how cgroups work.
> 
> As tracing is a privileged operation, but something that could be changed
> to allow a group to have access to, I wanted to make it easy for an admin
> to decide who gets to do what at boot up via the /etc/fstab file.

Yeah, ok. I think you could achieve the same thing via idmapped mounts. You
just need to swap out the mnt on /sys/kernel/tracing with an idmapped mount.

mount(8) should just give you the ability to specify "map the ids I explicitly
want to remap to something else and for the rest use the identity mapping". I
wanted that for other reasons anyway.

So in one of the next versions of mount(8) you can then do (where --beneath
means place the mount beneath the current one and --replace is
self-explanatory):

sudo mount --beneath -o X-mount.idmap='g:0:1234:1 u:0:0:1' /sys/kernel/tracing
sudo umount /sys/kernel/tracing

or as a shortcut provided by mount(8):

sudo mount --replace -o X-mount.idmap='g:0:1234:1 u:0:0:1' /sys/kernel/tracing 

In both cases you replace the mount without unmounting tracefs.

I can illustrate this right now though:

user1@localhost:~$ sudo mount --bind -o X-mount.idmap='g:0:1000:1 u:0:1000:1' 
/sys/kernel/tracing/ /mnt/

# This is a tool I wrote for testing the patchset I wrote back then.
user1@localhost:~/data/move-mount-beneath$ sudo ./move-mount --beneath 
--detached /mnt /sys/kernel/tracing
Mounting beneath top mount
Creating anonymous mount
Attaching mount /mnt -> /sys/kernel/tracing
Creating single detached mount

user1@localhost:~/data/move-mount-beneath$

# Now there's two mounts stacked on top of each other.
user1@localhost:~/data/move-mount-beneath$ findmnt | grep tracing
| `-/sys/kernel/tracingtracefstracefs 
rw,nosuid,nodev,noexec,relatime,idmapped
|   `-/sys/kernel/tracing  tracefstracefs 
rw,nosuid,nodev,noexec,relatime

user1@localhost:~/data/move-mount-beneath$ sudo ls -al /sys/kernel/tracing/| 
head
total 0
drwx--  6 root root 0 Jan  7 13:33 .
drwxr-xr-x 16 root root 0 Jan  7 13:33 ..
-r--r-  1 root root 0 Jan  7 13:33 README
-r--r-  1 root root 0 Jan  7 13:33 available_events
-r--r-  1 root root 0 Jan  7 13:33 available_filter_functions
-r--r-  1 root root 0 Jan  7 13:33 available_filter_functions_addrs
-r--r-  1 root root 0 Jan  7 13:33 available_tracers
-rw-r-  1 root root 0 Jan  7 13:33 buffer_percent
-rw-r-  1 root root 0 Jan  7 13:33 buffer_size_kb

# Reveal updated mount
user1@localhost:~/data/move-mount-beneath$ sudo umount /sys/kernel/tracing

user1@localhost:~/data/move-mount-beneath$ findmnt | grep tracing
| `-/sys/kernel/tracingtracefstracefs 
rw,nosuid,nodev,noexec,relatime,idmapped

user1@localhost:~/data/move-mount-beneath$ sudo ls -al /sys/kernel/tracing/| 
head
total 0
drwx--  6 user1 user1 0 Jan  7 13:33 .
drwxr-xr-x 16 root  root  0 Jan  7 13:33 ..
-r--r-  1 user1 user1 0 Jan  7 13:33 README
-r--r-  1 user1 user1 0 Jan  7 13:33 available_events
-r--r-  1 user1 user1 0 Jan  7 13:33 available_filter_functions
-r--r-  1 user

Re: [PATCH] tracefs/eventfs: Use root and instance inodes as default ownership

2024-01-05 Thread Steven Rostedt

On Fri, 5 Jan 2024 15:26:28 +0100
Christian Brauner  wrote:

> On Wed, Jan 03, 2024 at 08:32:46PM -0500, Steven Rostedt wrote:
> > From: "Steven Rostedt (Google)" 
> > 
> > Instead of walking the dentries on mount/remount to update the gid values of
> > all the dentries if a gid option is specified on mount, just update the root
> > inode. Add .getattr, .setattr, and .permissions on the tracefs inode
> > operations to update the permissions of the files and directories.
> > 
> > For all files and directories in the top level instance:
> > 
> >  /sys/kernel/tracing/*
> > 
> > It will use the root inode as the default permissions. The inode that
> > represents: /sys/kernel/tracing (or wherever it is mounted).
> > 
> > When an instance is created:
> > 
> >  mkdir /sys/kernel/tracing/instance/foo
> > 
> > The directory "foo" and all its files and directories underneath will use
> > the default of what foo is when it was created. A remount of tracefs will
> > not affect it.  
> 
> That kinda sounds like eventfs should actually be a separate filesystem.
> But I don't know enough about the relationship between the two concepts.

It may someday become one, as eventfs is used by perf where the rest of the
tracefs system is not.

> 
> > 
> > If a user were to modify the permissions of any file or directory in
> > tracefs, it will also no longer be modified by a change in ownership of a
> > remount.  
> 
> Very odd semantics and I would recommend to avoid that. It's just plain
> weird imo.
> 
> > 
> > The events directory, if it is in the top level instance, will use the
> > tracefs root inode as the default ownership for itself and all the files and
> > directories below it.
> > 
> > For the events directory in an instance ("foo"), it will keep the ownership
> > of what it was when it was created, and that will be used as the default
> > ownership for the files and directories beneath it.
> > 
> > Link: 
> > https://lore.kernel.org/linux-trace-kernel/CAHk-=wjvdgkjdxbbvln2wbznqp4ush46e3gqj9m7ug6dpx2...@mail.gmail.com/
> > 
> > Signed-off-by: Steven Rostedt (Google) 
> > ---  
> 
> So tracefs supports remounting with different uid/gid mount options and
> then actually wades through _all_ of the inodes and changes their
> ownership internally? What's the use-case for this? Containers?

No, in fact tracing doesn't work well with containers as tracing is global
to the entire machine. It can work with privileged containers though.

The reason for this is because tracefs was based off of debugfs where the
files and directores are created at boot up and mounted later. The reason
to do this was to allow users to mount with gid=GID to allow a given group
to have access to tracing. Without this update, tracefs would ignore it
like debugfs and proc does today.

I think its time I explain the purpose of tracefs and how it came to be.

The tracing system required a way to control tracing and read the traces.
It could have just used a new system like perf (although
/sys/kernel/debug/tracing predates perf), where it created a single ioctl()
like system call do do everything.

As the ftrace tracing came from PREEMPT_RT latency tracer and my own logdev
tracer, which both have an embedded background, I chose an interface that
could work with just an unmodified version of busybox. That is, I wanted it
to work with just cat and echo.

The main difference with tracefs compared to other file systems is that it
is a control interface, where writes happen as much as reads. The data read
is controlled. The closest thing I can think of is how cgroups work.

As tracing is a privileged operation, but something that could be changed
to allow a group to have access to, I wanted to make it easy for an admin
to decide who gets to do what at boot up via the /etc/fstab file.

> 
> Aside from optimizing this and the special semantics for this eventfs
> stuff that you really should think twice of doing, here's one idea for
> an extension that might alleviate some of the pain:
> 
> If you need flexible dynamic ownership change to e.g., be able to
> delegate (all, a directory, a single file of) tracefs to
> unprivileged/containers/whatever then you might want to consider
> supporting idmapped mounts for tracefs. Because then you can do stuff
> like:

I'm a novice here and have no idea on how id maps work ;-)

> 
> user1@localhost:~/data/scripts$ sudo mount --bind -o 
> X-mount.idmap='g:0:1000:1 u:0:1234:1' /run/ /mnt
> user1@localhost:~/data/scripts$ ls -ln /run/
> total 12
> drwxr-xr-x  2 0  0   40 Jan  5 12:12 credentials
> drwx--  2 0  0   40 Jan  5 11:57 cryptsetup
> drwxr-xr-x  2 0  0   60 Jan

Re: [PATCH] tracefs/eventfs: Use root and instance inodes as default ownership

2024-01-05 Thread Christian Brauner

On Wed, Jan 03, 2024 at 08:32:46PM -0500, Steven Rostedt wrote:
> From: "Steven Rostedt (Google)" 
> 
> Instead of walking the dentries on mount/remount to update the gid values of
> all the dentries if a gid option is specified on mount, just update the root
> inode. Add .getattr, .setattr, and .permissions on the tracefs inode
> operations to update the permissions of the files and directories.
> 
> For all files and directories in the top level instance:
> 
>  /sys/kernel/tracing/*
> 
> It will use the root inode as the default permissions. The inode that
> represents: /sys/kernel/tracing (or wherever it is mounted).
> 
> When an instance is created:
> 
>  mkdir /sys/kernel/tracing/instance/foo
> 
> The directory "foo" and all its files and directories underneath will use
> the default of what foo is when it was created. A remount of tracefs will
> not affect it.

That kinda sounds like eventfs should actually be a separate filesystem.
But I don't know enough about the relationship between the two concepts.

> 
> If a user were to modify the permissions of any file or directory in
> tracefs, it will also no longer be modified by a change in ownership of a
> remount.

Very odd semantics and I would recommend to avoid that. It's just plain
weird imo.

> 
> The events directory, if it is in the top level instance, will use the
> tracefs root inode as the default ownership for itself and all the files and
> directories below it.
> 
> For the events directory in an instance ("foo"), it will keep the ownership
> of what it was when it was created, and that will be used as the default
> ownership for the files and directories beneath it.
> 
> Link: 
> https://lore.kernel.org/linux-trace-kernel/CAHk-=wjvdgkjdxbbvln2wbznqp4ush46e3gqj9m7ug6dpx2...@mail.gmail.com/
> 
> Signed-off-by: Steven Rostedt (Google) 
> ---

So tracefs supports remounting with different uid/gid mount options and
then actually wades through _all_ of the inodes and changes their
ownership internally? What's the use-case for this? Containers?

Aside from optimizing this and the special semantics for this eventfs
stuff that you really should think twice of doing, here's one idea for
an extension that might alleviate some of the pain:

If you need flexible dynamic ownership change to e.g., be able to
delegate (all, a directory, a single file of) tracefs to
unprivileged/containers/whatever then you might want to consider
supporting idmapped mounts for tracefs. Because then you can do stuff
like:

user1@localhost:~/data/scripts$ sudo mount --bind -o X-mount.idmap='g:0:1000:1 
u:0:1234:1' /run/ /mnt
user1@localhost:~/data/scripts$ ls -ln /run/
total 12
drwxr-xr-x  2 0  0   40 Jan  5 12:12 credentials
drwx--  2 0  0   40 Jan  5 11:57 cryptsetup
drwxr-xr-x  2 0  0   60 Jan  5 11:57 dbus
drwx--  6 0  0  280 Jan  5 11:57 incus_agent
prw---  1 0  00 Jan  5 11:57 initctl
drwxrwxrwt  4 0  0   80 Jan  5 11:57 lock
drwxr-xr-x  3 0  0   60 Jan  5 11:57 log
drwx--  2 0  0   40 Jan  5 11:57 lvm
-r--r--r--  1 0  0   33 Jan  5 11:57 machine-id
-rw-r--r--  1 0  0  101 Jan  5 11:58 motd.dynamic
drwxr-xr-x  2 0  0   40 Jan  5 11:57 mount
drwx--  2 0  0   40 Jan  5 11:57 multipath
drwxr-xr-x  2 0  0   40 Jan  5 11:57 sendsigs.omit.d
lrwxrwxrwx  1 0  08 Jan  5 11:57 shm -> /dev/shm
drwx--x--x  2 0  0   40 Jan  5 11:57 sudo
drwxr-xr-x 24 0  0  660 Jan  5 14:30 systemd
drwxr-xr-x  6 0  0  140 Jan  5 14:30 udev
drwxr-xr-x  4 0  0   80 Jan  5 11:58 user
-rw-rw-r--  1 0 43 2304 Jan  5 15:15 utmp

user1@localhost:~/data/scripts$ ls -ln /mnt/
total 12
drwxr-xr-x  2 1234  1000   40 Jan  5 12:12 credentials
drwx--  2 1234  1000   40 Jan  5 11:57 cryptsetup
drwxr-xr-x  2 1234  1000   60 Jan  5 11:57 dbus
drwxr-xr-x  2 1234  1000   40 Jan  5 11:57 incus_agent
prw---  1 1234  10000 Jan  5 11:57 initctl
drwxr-xr-x  2 1234  1000   40 Jan  5 11:57 lock
drwxr-xr-x  3 1234  1000   60 Jan  5 11:57 log
drwx--  2 1234  1000   40 Jan  5 11:57 lvm
-r--r--r--  1 1234  1000   33 Jan  5 11:57 machine-id
-rw-r--r--  1 1234  1000  101 Jan  5 11:58 motd.dynamic
drwxr-xr-x  2 1234  1000   40 Jan  5 11:57 mount
drwx--  2 1234  1000   40 Jan  5 11:57 multipath
drwxr-xr-x  2 1234  1000   40 Jan  5 11:57 sendsigs.omit.d
lrwxrwxrwx  1 1234  10008 Jan  5 11:57 shm -> /dev/shm
drwx--x--x  2 1234  1000   40 Jan  5 11:57 sudo
drwxr-xr-x 24 1234  1000  660 Jan  5 14:30 systemd
drwxr-xr-x  6 1234  1000  140 Jan  5 14:30 udev
drwxr-xr-x  4 1234  1000   80 Jan  5 11:58 user
-rw-rw-r--  1 1234 65534 2304 Jan  5 15:15 utmp

Where you can see that ownership of this tmpfs instance in this example
is changed. I'm not trying to advocate here but this will probably
ultimately be nicer for your users because it means that a container
manager or whatever can be handed a part of tracefs (or all of it) and
the own

Re: [PATCH] tracefs/eventfs: Use root and instance inodes as default ownership

2024-01-03 Thread Al Viro

On Wed, Jan 03, 2024 at 09:25:06PM -0500, Steven Rostedt wrote:
> On Thu, 4 Jan 2024 01:48:37 +
> Al Viro  wrote:
> 
> > On Wed, Jan 03, 2024 at 08:32:46PM -0500, Steven Rostedt wrote:
> > 
> > > + /* Get the tracefs root from the parent */
> > > + inode = d_inode(dentry->d_parent);
> > > + inode = d_inode(inode->i_sb->s_root);  
> > 
> > That makes no sense.  First of all, for any positive dentry we have
> > dentry->d_sb == dentry->d_inode->i_sb.  And it's the same for all
> > dentries on given superblock.  So what's the point of that dance?
> > If you want the root inode, just go for d_inode(dentry->d_sb->s_root)
> > and be done with that...
> 
> That was more of thinking that the dentry and dentry->d_parent are
> different. As dentry is part of eventfs and dentry->d_parent is part of
> tracefs.

???

>Currently they both have the same superblock so yeah, I could just
> write it that way too and it would work. But in my head, I was thinking
> that they behave differently and maybe one day eventfs would get its own
> superblock which would not work.

->d_parent *always* points to the same filesystem; if you get an (automounted?)
mountpoint there, ->d_parent simply won't work - it will point to dentry itself.

> To explain this better:
> 
>   /sys/kernel/tracing/ is the parent of /sys/kernel/tracing/events
> 
> But everything but "events" in /sys/kernel/tracing/* is part of tracefs.
> Everything in /sys/kernel/tracing/events is part of eventfs.
> 
> That was my thought process. But as both tracefs and eventfs still use
> tracefs_get_inode(), it would work as you state.
> 
> I'll update that, as I don't foresee that eventfs will become its own file
> system.

There is no way to get to underlying mountpoint by dentry - simply because
the same fs instance can be mounted in any number of places.

A crude overview of taxonomy:

file_system_type: what filesystem instances belong to.  Not quite the same
thing as fs driver (one driver can provide several of those).  Usually
it's 1-to-1, but that's not required (e.g. NFS vs NFSv4, or ext[234], or...).

super_block: individual filesystem instance.  Hosts dentry tree (connected or
several disconnected parts - think NFSv4 or the state while trying to get
a dentry by fhandle, etc.).

dentry: object in a filesystem's directory tree(s).  Always belongs to
specific filesystem instance - that relationship never changes.  Tree
structure (and names) _within_ _filesystem_ belong on that level.
->d_parent is part of that tree structure; never NULL, root of a (sub)tree
has it pointing to itself.  Might be negative, might refer to a filesystem 
object
(file, directory, symlink, etc.).

inode: filesystem object (file, directory, etc.).  Always belongs to
specific filesystem instance.  Non-directory inodes might have any
number of dentry instances (aliases) refering to it; a directory one - no 
more than one.  Filesystem object contents belongs here; multiple hardlinks
have different dentries and the same inode.  Of course, filesystem type in
question might have no such thing as multiple hardlinks - that's up to
filesystem.  In general there is no way to find (or enumerate) such links;
e.g. a local filesystem might have an extra hardlink somewhere we had
never looked at and there won't be any dentries for such hardlinks and
no way to get them short of doing the full search of the entire tree.
The situation with e.g. NFS client is even worse, obviously.

mount: in a sense, mount to super_block is what dentry is to inode.  It
provides a view of (sub)tree hosted in given filesystem instance.  The
same filesystem may have any number of mounts, refering to its subtrees
(possibly the same subtree for each, possibly all different - up to
the callers of mount(2)).  They form mount tree(s) - that's where the
notions related to "this mounted on top of that" belong.  Note that
they can be moved around - with no telling the filesystem about that
happening.  Again, there's no such thing as "the mountpoint of given
filesystem instance" - it might be mounted in any number of places
at the same time.  Specific mount - sure, no problem, except that it
can move around.

namespace: mount tree.  Unlike everything prior, this one is a part of
process state - same as descriptor table, mappings, etc.

file: opened IO channel.  It does refer to specific mount and specific
dentry (and thus filesystem instance and an inode on it).  Current
IO position lives here, so does any per-open(2) state.

descriptor table: mapping from numbers to IO channels (opened files).
Again, a part of process state.  dup() creates a new entry, with
reference to the same file as the old one; multiple open() of the
same pathname will each yield a separate opened file.  _Some_ state
belongs here (close-on-exec, m

[PATCH v2] tracefs/eventfs: Use root and instance inodes as default ownership

2024-01-03 Thread Steven Rostedt

try *parent, void *data,
   const struct file_operations *fops)
 {
+   struct tracefs_inode *ti;
struct dentry *dentry;
struct inode *inode;
 
@@ -621,7 +635,11 @@ struct dentry *tracefs_create_file(const char *name, 
umode_t mode,
if (unlikely(!inode))
return tracefs_failed_creating(dentry);
 
+   ti = get_tracefs(inode);
+   ti->private = instance_inode(parent, inode);
+
inode->i_mode = mode;
+   inode->i_op = _file_inode_operations;
inode->i_fop = fops ? fops : _file_operations;
inode->i_private = data;
inode->i_uid = d_inode(dentry->d_parent)->i_uid;
@@ -634,6 +652,7 @@ struct dentry *tracefs_create_file(const char *name, 
umode_t mode,
 static struct dentry *__create_dir(const char *name, struct dentry *parent,
   const struct inode_operations *ops)
 {
+   struct tracefs_inode *ti;
struct dentry *dentry = tracefs_start_creating(name, parent);
struct inode *inode;
 
@@ -651,6 +670,9 @@ static struct dentry *__create_dir(const char *name, struct 
dentry *parent,
inode->i_uid = d_inode(dentry->d_parent)->i_uid;
inode->i_gid = d_inode(dentry->d_parent)->i_gid;
 
+   ti = get_tracefs(inode);
+   ti->private = instance_inode(parent, inode);
+
/* directory inodes start off with i_nlink == 2 (for "." entry) */
inc_nlink(inode);
d_instantiate(dentry, inode);
@@ -681,7 +703,7 @@ struct dentry *tracefs_create_dir(const char *name, struct 
dentry *parent)
if (security_locked_down(LOCKDOWN_TRACEFS))
return NULL;
 
-   return __create_dir(name, parent, _dir_inode_operations);
+   return __create_dir(name, parent, _dir_inode_operations);
 }
 
 /**
@@ -712,7 +734,7 @@ __init struct dentry *tracefs_create_instance_dir(const 
char *name,
if (WARN_ON(tracefs_ops.mkdir || tracefs_ops.rmdir))
return NULL;
 
-   dentry = __create_dir(name, parent, _dir_inode_operations);
+   dentry = __create_dir(name, parent, 
_instance_dir_inode_operations);
if (!dentry)
return NULL;
 
diff --git a/fs/tracefs/internal.h b/fs/tracefs/internal.h
index 42bdeb471a07..12b7d0150ae9 100644
--- a/fs/tracefs/internal.h
+++ b/fs/tracefs/internal.h
@@ -5,6 +5,9 @@
 enum {
TRACEFS_EVENT_INODE = BIT(1),
TRACEFS_EVENT_TOP_INODE = BIT(2),
+   TRACEFS_GID_PERM_SET= BIT(3),
+   TRACEFS_UID_PERM_SET= BIT(4),
+   TRACEFS_INSTANCE_INODE  = BIT(5),
 };
 
 struct tracefs_inode {
-- 
2.42.0

Re: [PATCH] tracefs/eventfs: Use root and instance inodes as default ownership

2024-01-03 Thread Steven Rostedt

On Thu, 4 Jan 2024 01:48:37 +
Al Viro  wrote:

> On Wed, Jan 03, 2024 at 08:32:46PM -0500, Steven Rostedt wrote:
> 
> > +   /* Get the tracefs root from the parent */
> > +   inode = d_inode(dentry->d_parent);
> > +   inode = d_inode(inode->i_sb->s_root);  
> 
> That makes no sense.  First of all, for any positive dentry we have
> dentry->d_sb == dentry->d_inode->i_sb.  And it's the same for all
> dentries on given superblock.  So what's the point of that dance?
> If you want the root inode, just go for d_inode(dentry->d_sb->s_root)
> and be done with that...

That was more of thinking that the dentry and dentry->d_parent are
different. As dentry is part of eventfs and dentry->d_parent is part of
tracefs. Currently they both have the same superblock so yeah, I could just
write it that way too and it would work. But in my head, I was thinking
that they behave differently and maybe one day eventfs would get its own
superblock which would not work.

To explain this better:

  /sys/kernel/tracing/ is the parent of /sys/kernel/tracing/events

But everything but "events" in /sys/kernel/tracing/* is part of tracefs.
Everything in /sys/kernel/tracing/events is part of eventfs.

That was my thought process. But as both tracefs and eventfs still use
tracefs_get_inode(), it would work as you state.

I'll update that, as I don't foresee that eventfs will become its own file
system.

Thanks,

-- Steve

Re: [PATCH] tracefs/eventfs: Use root and instance inodes as default ownership

2024-01-03 Thread Steven Rostedt

On Thu, 4 Jan 2024 01:59:10 +
Al Viro  wrote:

> On Wed, Jan 03, 2024 at 08:32:46PM -0500, Steven Rostedt wrote:
> 
> > +static struct inode *instance_inode(struct dentry *parent, struct inode 
> > *inode)
> > +{
> > +   struct tracefs_inode *ti;
> > +   struct inode *root_inode;
> > +
> > +   root_inode = d_inode(inode->i_sb->s_root);
> > +
> > +   /* If parent is NULL then use root inode */
> > +   if (!parent)
> > +   return root_inode;
> > +
> > +   /* Find the inode that is flagged as an instance or the root inode */
> > +   do {
> > +   inode = d_inode(parent);
> > +   if (inode == root_inode)
> > +   return root_inode;
> > +
> > +   ti = get_tracefs(inode);
> > +
> > +   if (ti->flags & TRACEFS_INSTANCE_INODE)
> > +   return inode;
> > +   } while ((parent = parent->d_parent));  
> 
> *blink*
> 
> This is equivalent to
>   ...
>   parent = parent->d_parent;
>   } while (true);

Yeah, that loop went through a few iterations as I first thought that root
was a tracefs_inode and the get_tracefs() would work on it. No, it was not,
and it caused a cash. But I didn't rewrite the loop well after fixing that.

I was also not sure if parent could be NULL, and wanted to protect against it.

> 
> ->d_parent is *never* NULL.  And what the hell is that loop supposed to do,  
> anyway?  Find the nearest ancestor tagged with TRACEFS_INSTANCE_INODE?
> 
> If root is not marked that way, I would suggest
>   if (!parent)
>   parent = inode->i_sb->s_root;
>   while (!IS_ROOT(parent)) {
>   struct tracefs_inode *ti = get_tracefs(parent->d_inode);
>   if (ti->flags & TRACEFS_INSTANCE_INODE)
>   break;
>   parent = parent->d_parent;
>   }
>   return parent->d_inode;

Sure, I could rewrite it that way too.

Thanks,

-- Steve

Re: [PATCH] tracefs/eventfs: Use root and instance inodes as default ownership

2024-01-03 Thread Al Viro

On Wed, Jan 03, 2024 at 08:32:46PM -0500, Steven Rostedt wrote:

> +static struct inode *instance_inode(struct dentry *parent, struct inode 
> *inode)
> +{
> + struct tracefs_inode *ti;
> + struct inode *root_inode;
> +
> + root_inode = d_inode(inode->i_sb->s_root);
> +
> + /* If parent is NULL then use root inode */
> + if (!parent)
> + return root_inode;
> +
> + /* Find the inode that is flagged as an instance or the root inode */
> + do {
> + inode = d_inode(parent);
> + if (inode == root_inode)
> + return root_inode;
> +
> + ti = get_tracefs(inode);
> +
> + if (ti->flags & TRACEFS_INSTANCE_INODE)
> + return inode;
> + } while ((parent = parent->d_parent));

*blink*

This is equivalent to
...
parent = parent->d_parent;
} while (true);

->d_parent is *never* NULL.  And what the hell is that loop supposed to do,
anyway?  Find the nearest ancestor tagged with TRACEFS_INSTANCE_INODE?

If root is not marked that way, I would suggest
if (!parent)
parent = inode->i_sb->s_root;
while (!IS_ROOT(parent)) {
struct tracefs_inode *ti = get_tracefs(parent->d_inode);
if (ti->flags & TRACEFS_INSTANCE_INODE)
break;
parent = parent->d_parent;
}
return parent->d_inode;

Re: [PATCH] tracefs/eventfs: Use root and instance inodes as default ownership

2024-01-03 Thread Al Viro

On Wed, Jan 03, 2024 at 08:32:46PM -0500, Steven Rostedt wrote:

> + /* Get the tracefs root from the parent */
> + inode = d_inode(dentry->d_parent);
> + inode = d_inode(inode->i_sb->s_root);

That makes no sense.  First of all, for any positive dentry we have
dentry->d_sb == dentry->d_inode->i_sb.  And it's the same for all
dentries on given superblock.  So what's the point of that dance?
If you want the root inode, just go for d_inode(dentry->d_sb->s_root)
and be done with that...

[PATCH] tracefs/eventfs: Use root and instance inodes as default ownership

2024-01-03 Thread Steven Rostedt

  const struct file_operations *fops)
 {
+   struct tracefs_inode *ti;
struct dentry *dentry;
struct inode *inode;
 
@@ -621,7 +642,11 @@ struct dentry *tracefs_create_file(const char *name, 
umode_t mode,
if (unlikely(!inode))
return tracefs_failed_creating(dentry);
 
+   ti = get_tracefs(inode);
+   ti->private = instance_inode(parent, inode);
+
inode->i_mode = mode;
+   inode->i_op = _file_inode_operations;
inode->i_fop = fops ? fops : _file_operations;
inode->i_private = data;
inode->i_uid = d_inode(dentry->d_parent)->i_uid;
@@ -634,6 +659,7 @@ struct dentry *tracefs_create_file(const char *name, 
umode_t mode,
 static struct dentry *__create_dir(const char *name, struct dentry *parent,
   const struct inode_operations *ops)
 {
+   struct tracefs_inode *ti;
struct dentry *dentry = tracefs_start_creating(name, parent);
struct inode *inode;
 
@@ -651,6 +677,9 @@ static struct dentry *__create_dir(const char *name, struct 
dentry *parent,
inode->i_uid = d_inode(dentry->d_parent)->i_uid;
inode->i_gid = d_inode(dentry->d_parent)->i_gid;
 
+   ti = get_tracefs(inode);
+   ti->private = instance_inode(parent, inode);
+
/* directory inodes start off with i_nlink == 2 (for "." entry) */
inc_nlink(inode);
d_instantiate(dentry, inode);
@@ -681,7 +710,7 @@ struct dentry *tracefs_create_dir(const char *name, struct 
dentry *parent)
if (security_locked_down(LOCKDOWN_TRACEFS))
return NULL;
 
-   return __create_dir(name, parent, _dir_inode_operations);
+   return __create_dir(name, parent, _dir_inode_operations);
 }
 
 /**
@@ -712,7 +741,7 @@ __init struct dentry *tracefs_create_instance_dir(const 
char *name,
if (WARN_ON(tracefs_ops.mkdir || tracefs_ops.rmdir))
return NULL;
 
-   dentry = __create_dir(name, parent, _dir_inode_operations);
+   dentry = __create_dir(name, parent, 
_instance_dir_inode_operations);
if (!dentry)
return NULL;
 
diff --git a/fs/tracefs/internal.h b/fs/tracefs/internal.h
index 42bdeb471a07..12b7d0150ae9 100644
--- a/fs/tracefs/internal.h
+++ b/fs/tracefs/internal.h
@@ -5,6 +5,9 @@
 enum {
TRACEFS_EVENT_INODE = BIT(1),
TRACEFS_EVENT_TOP_INODE = BIT(2),
+   TRACEFS_GID_PERM_SET= BIT(3),
+   TRACEFS_UID_PERM_SET= BIT(4),
+   TRACEFS_INSTANCE_INODE  = BIT(5),
 };
 
 struct tracefs_inode {
-- 
2.42.0

Re: [PATCH] dax: make sure inodes are flushed before destroy cache

2022-02-14 Thread Ira Weiny

On Mon, Feb 14, 2022 at 12:09:54PM -0800, Dan Williams wrote:
> On Mon, Feb 14, 2022 at 9:59 AM Ira Weiny  wrote:
> >
> > On Fri, Feb 11, 2022 at 11:11:11PM -0800, Tong Zhang wrote:
> > > A bug can be triggered by following command
> > >
> > > $ modprobe nd_pmem && modprobe -r nd_pmem
> > >
> > > [   10.060014] BUG dax_cache (Not tainted): Objects remaining in 
> > > dax_cache on __kmem_cache_shutdown()
> > > [   10.060938] Slab 0x85b729ac objects=9 used=1 
> > > fp=0x4f5ae469 flags=0x2010200(slab|head|node)
> > > [   10.062433] Call Trace:
> > > [   10.062673]  dump_stack_lvl+0x34/0x44
> > > [   10.062865]  slab_err+0x90/0xd0
> > > [   10.063619]  __kmem_cache_shutdown+0x13b/0x2f0
> > > [   10.063848]  kmem_cache_destroy+0x4a/0x110
> > > [   10.064058]  __x64_sys_delete_module+0x265/0x300
> > >
> > > This is caused by dax_fs_exit() not flushing inodes before destroy cache.
> > > To fix this issue, call rcu_barrier() before destroy cache.
> >
> > I don't doubt that this fixes the bug.  However, I can't help but think 
> > this is
> > hiding a bug, or perhaps a missing step, in the kmem_cache layer?  As far 
> > as I
> > can see dax does not call call_rcu() and only uses srcu not rcu?  I was 
> > tempted
> > to suggest srcu_barrier() but dax does not call call_srcu() either.
> 
> This rcu_barrier() is associated with the call_rcu() in destroy_inode().

Ok yea.

> 
> While kern_unmount() does a full sycnrhonize_rcu() after clearing
> ->mnt_ns. Any pending destroy_inode() callbacks need to be flushed
> before the kmem_cache is destroyed.
> 
> > So I'm not clear about what is really going on and why this fixes it.  I 
> > know
> > that dax is not using srcu is a standard way so perhaps this helps in a way 
> > I
> > don't quite grok?  If so perhaps a comment here would be in order?
> 
> Looks like a common pattern I missed that all filesystem exit paths implement.

I think a comment would be in order, especially since since it looks like every
other FS has one:

fs/ext4/super.c:

...
/*
 * Make sure all delayed rcu free inodes are flushed before we
 * destroy cache.
 */
rcu_barrier();  
...

Anyway ok.

Reviewed-by: Ira Weiny 

Thanks for looking Dan,
Ira

Re: [PATCH] dax: make sure inodes are flushed before destroy cache

2022-02-14 Thread Dan Williams

On Mon, Feb 14, 2022 at 9:59 AM Ira Weiny  wrote:
>
> On Fri, Feb 11, 2022 at 11:11:11PM -0800, Tong Zhang wrote:
> > A bug can be triggered by following command
> >
> > $ modprobe nd_pmem && modprobe -r nd_pmem
> >
> > [   10.060014] BUG dax_cache (Not tainted): Objects remaining in dax_cache 
> > on __kmem_cache_shutdown()
> > [   10.060938] Slab 0x85b729ac objects=9 used=1 
> > fp=0x4f5ae469 flags=0x2010200(slab|head|node)
> > [   10.062433] Call Trace:
> > [   10.062673]  dump_stack_lvl+0x34/0x44
> > [   10.062865]  slab_err+0x90/0xd0
> > [   10.063619]  __kmem_cache_shutdown+0x13b/0x2f0
> > [   10.063848]  kmem_cache_destroy+0x4a/0x110
> > [   10.064058]  __x64_sys_delete_module+0x265/0x300
> >
> > This is caused by dax_fs_exit() not flushing inodes before destroy cache.
> > To fix this issue, call rcu_barrier() before destroy cache.
>
> I don't doubt that this fixes the bug.  However, I can't help but think this 
> is
> hiding a bug, or perhaps a missing step, in the kmem_cache layer?  As far as I
> can see dax does not call call_rcu() and only uses srcu not rcu?  I was 
> tempted
> to suggest srcu_barrier() but dax does not call call_srcu() either.

This rcu_barrier() is associated with the call_rcu() in destroy_inode().

While kern_unmount() does a full sycnrhonize_rcu() after clearing
->mnt_ns. Any pending destroy_inode() callbacks need to be flushed
before the kmem_cache is destroyed.

> So I'm not clear about what is really going on and why this fixes it.  I know
> that dax is not using srcu is a standard way so perhaps this helps in a way I
> don't quite grok?  If so perhaps a comment here would be in order?

Looks like a common pattern I missed that all filesystem exit paths implement.

Re: [PATCH] dax: make sure inodes are flushed before destroy cache

2022-02-14 Thread Ira Weiny

On Fri, Feb 11, 2022 at 11:11:11PM -0800, Tong Zhang wrote:
> A bug can be triggered by following command
> 
> $ modprobe nd_pmem && modprobe -r nd_pmem
> 
> [   10.060014] BUG dax_cache (Not tainted): Objects remaining in dax_cache on 
> __kmem_cache_shutdown()
> [   10.060938] Slab 0x85b729ac objects=9 used=1 fp=0x4f5ae469 
> flags=0x2010200(slab|head|node)
> [   10.062433] Call Trace:
> [   10.062673]  dump_stack_lvl+0x34/0x44
> [   10.062865]  slab_err+0x90/0xd0
> [   10.063619]  __kmem_cache_shutdown+0x13b/0x2f0
> [   10.063848]  kmem_cache_destroy+0x4a/0x110
> [   10.064058]  __x64_sys_delete_module+0x265/0x300
> 
> This is caused by dax_fs_exit() not flushing inodes before destroy cache.
> To fix this issue, call rcu_barrier() before destroy cache.

I don't doubt that this fixes the bug.  However, I can't help but think this is
hiding a bug, or perhaps a missing step, in the kmem_cache layer?  As far as I
can see dax does not call call_rcu() and only uses srcu not rcu?  I was tempted
to suggest srcu_barrier() but dax does not call call_srcu() either.

So I'm not clear about what is really going on and why this fixes it.  I know
that dax is not using srcu is a standard way so perhaps this helps in a way I
don't quite grok?  If so perhaps a comment here would be in order?

Ira

> 
> Signed-off-by: Tong Zhang 
> ---
>  drivers/dax/super.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/drivers/dax/super.c b/drivers/dax/super.c
> index e3029389d809..6bd565fe2e63 100644
> --- a/drivers/dax/super.c
> +++ b/drivers/dax/super.c
> @@ -476,6 +476,7 @@ static int dax_fs_init(void)
>  static void dax_fs_exit(void)
>  {
>   kern_unmount(dax_mnt);
> + rcu_barrier();
>   kmem_cache_destroy(dax_cache);
>  }
>  
> -- 
> 2.25.1
> 
>

Re: [PATCH] dax: make sure inodes are flushed before destroy cache

2022-02-14 Thread Christoph Hellwig

Looks good,

Reviewed-by: Christoph Hellwig

[PATCH] dax: make sure inodes are flushed before destroy cache

2022-02-11 Thread Tong Zhang

A bug can be triggered by following command

$ modprobe nd_pmem && modprobe -r nd_pmem

[   10.060014] BUG dax_cache (Not tainted): Objects remaining in dax_cache on 
__kmem_cache_shutdown()
[   10.060938] Slab 0x85b729ac objects=9 used=1 fp=0x4f5ae469 
flags=0x2010200(slab|head|node)
[   10.062433] Call Trace:
[   10.062673]  dump_stack_lvl+0x34/0x44
[   10.062865]  slab_err+0x90/0xd0
[   10.063619]  __kmem_cache_shutdown+0x13b/0x2f0
[   10.063848]  kmem_cache_destroy+0x4a/0x110
[   10.064058]  __x64_sys_delete_module+0x265/0x300

This is caused by dax_fs_exit() not flushing inodes before destroy cache.
To fix this issue, call rcu_barrier() before destroy cache.

Signed-off-by: Tong Zhang 
---
 drivers/dax/super.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index e3029389d809..6bd565fe2e63 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -476,6 +476,7 @@ static int dax_fs_init(void)
 static void dax_fs_exit(void)
 {
kern_unmount(dax_mnt);
+   rcu_barrier();
kmem_cache_destroy(dax_cache);
 }
 
-- 
2.25.1

Re: [PATCH] tmpfs: Don't use 64-bit inodes by defulat with 32-bit ino_t

2021-02-05 Thread Seth Forshee

On Fri, Feb 05, 2021 at 01:23:13PM -0800, Andrew Morton wrote:
> On Fri, 5 Feb 2021 14:55:43 -0600 Seth Forshee  
> wrote:
> 
> > On Fri, Feb 05, 2021 at 12:41:57PM -0800, Andrew Morton wrote:
> > > On Fri,  5 Feb 2021 14:21:59 -0600 Seth Forshee 
> > >  wrote:
> > > 
> > > > Currently there seems to be an assumption in tmpfs that 64-bit
> > > > architectures also have a 64-bit ino_t. This is not true; s390 at
> > > > least has a 32-bit ino_t. With CONFIG_TMPFS_INODE64=y tmpfs
> > > > mounts will get 64-bit inode numbers and display "inode64" in the
> > > > mount options, but passing the "inode64" mount option will fail.
> > > > This leads to the following behavior:
> > > > 
> > > >  # mkdir mnt
> > > >  # mount -t tmpfs nodev mnt
> > > >  # mount -o remount,rw mnt
> > > >  mount: /home/ubuntu/mnt: mount point not mounted or bad option.
> > > > 
> > > > As mount sees "inode64" in the mount options and thus passes it
> > > > in the options for the remount.
> > > > 
> > > > Ideally CONFIG_TMPFS_INODE64 would depend on sizeof(ino_t) < 8,
> > > > but I don't think it's possible to test for this (potentially
> > > > CONFIG_ARCH_HAS_64BIT_INO_T or similar could be added, but I'm
> > > > not sure whether or not that is wanted). So fix this by simply
> > > > refusing to honor the CONFIG_TMPFS_INODE64 setting when
> > > > sizeof(ino_t) < 8.
> > > 
> > > How about changing s390 Kconfig so that CONFIG_TMPFS_INODE64 is not 
> > > enabled?
> > 
> > I did do that for our config. I see the s390 defconfig has it enabled,
> > so I will send a patch for that too. But the fact that it can be
> > configured that way and that the code behaves badly still seems
> > problematic.
> 
> I meant
> 
> --- a/fs/Kconfig~a
> +++ a/fs/Kconfig
> @@ -203,7 +203,7 @@ config TMPFS_XATTR
>  
>  config TMPFS_INODE64
>   bool "Use 64-bit ino_t by default in tmpfs"
> - depends on TMPFS && 64BIT
> + depends on TMPFS && 64BIT && !S390
>   default n
>   help
> tmpfs has historically used only inode numbers as wide as an unsigned
> _

Ah. s390 does appear to be the only architecture that doesn't use
unsigned long for ino_t, so that seems good to me. I can send a patch if
you want, but seems like you've already got one ...

Seth

Re: [PATCH] tmpfs: Don't use 64-bit inodes by defulat with 32-bit ino_t

2021-02-05 Thread Andrew Morton

On Fri, 5 Feb 2021 14:55:43 -0600 Seth Forshee  
wrote:

> On Fri, Feb 05, 2021 at 12:41:57PM -0800, Andrew Morton wrote:
> > On Fri,  5 Feb 2021 14:21:59 -0600 Seth Forshee 
> >  wrote:
> > 
> > > Currently there seems to be an assumption in tmpfs that 64-bit
> > > architectures also have a 64-bit ino_t. This is not true; s390 at
> > > least has a 32-bit ino_t. With CONFIG_TMPFS_INODE64=y tmpfs
> > > mounts will get 64-bit inode numbers and display "inode64" in the
> > > mount options, but passing the "inode64" mount option will fail.
> > > This leads to the following behavior:
> > > 
> > >  # mkdir mnt
> > >  # mount -t tmpfs nodev mnt
> > >  # mount -o remount,rw mnt
> > >  mount: /home/ubuntu/mnt: mount point not mounted or bad option.
> > > 
> > > As mount sees "inode64" in the mount options and thus passes it
> > > in the options for the remount.
> > > 
> > > Ideally CONFIG_TMPFS_INODE64 would depend on sizeof(ino_t) < 8,
> > > but I don't think it's possible to test for this (potentially
> > > CONFIG_ARCH_HAS_64BIT_INO_T or similar could be added, but I'm
> > > not sure whether or not that is wanted). So fix this by simply
> > > refusing to honor the CONFIG_TMPFS_INODE64 setting when
> > > sizeof(ino_t) < 8.
> > 
> > How about changing s390 Kconfig so that CONFIG_TMPFS_INODE64 is not enabled?
> 
> I did do that for our config. I see the s390 defconfig has it enabled,
> so I will send a patch for that too. But the fact that it can be
> configured that way and that the code behaves badly still seems
> problematic.

I meant

--- a/fs/Kconfig~a
+++ a/fs/Kconfig
@@ -203,7 +203,7 @@ config TMPFS_XATTR
 
 config TMPFS_INODE64
bool "Use 64-bit ino_t by default in tmpfs"
-   depends on TMPFS && 64BIT
+   depends on TMPFS && 64BIT && !S390
default n
help
  tmpfs has historically used only inode numbers as wide as an unsigned
_

Re: [PATCH] tmpfs: Don't use 64-bit inodes by defulat with 32-bit ino_t

2021-02-05 Thread Seth Forshee

On Fri, Feb 05, 2021 at 12:41:57PM -0800, Andrew Morton wrote:
> On Fri,  5 Feb 2021 14:21:59 -0600 Seth Forshee  
> wrote:
> 
> > Currently there seems to be an assumption in tmpfs that 64-bit
> > architectures also have a 64-bit ino_t. This is not true; s390 at
> > least has a 32-bit ino_t. With CONFIG_TMPFS_INODE64=y tmpfs
> > mounts will get 64-bit inode numbers and display "inode64" in the
> > mount options, but passing the "inode64" mount option will fail.
> > This leads to the following behavior:
> > 
> >  # mkdir mnt
> >  # mount -t tmpfs nodev mnt
> >  # mount -o remount,rw mnt
> >  mount: /home/ubuntu/mnt: mount point not mounted or bad option.
> > 
> > As mount sees "inode64" in the mount options and thus passes it
> > in the options for the remount.
> > 
> > Ideally CONFIG_TMPFS_INODE64 would depend on sizeof(ino_t) < 8,
> > but I don't think it's possible to test for this (potentially
> > CONFIG_ARCH_HAS_64BIT_INO_T or similar could be added, but I'm
> > not sure whether or not that is wanted). So fix this by simply
> > refusing to honor the CONFIG_TMPFS_INODE64 setting when
> > sizeof(ino_t) < 8.
> 
> How about changing s390 Kconfig so that CONFIG_TMPFS_INODE64 is not enabled?

I did do that for our config. I see the s390 defconfig has it enabled,
so I will send a patch for that too. But the fact that it can be
configured that way and that the code behaves badly still seems
problematic.

> 
> > Fixes: ea3271f7196c ("tmpfs: support 64-bit inums per-sb")
> > Signed-off-by: Seth Forshee 
> 
> With a cc:stable, I assume?

Yes, I forgot that. Thanks.

Seth

Re: [PATCH] tmpfs: Don't use 64-bit inodes by defulat with 32-bit ino_t

2021-02-05 Thread Andrew Morton

On Fri,  5 Feb 2021 14:21:59 -0600 Seth Forshee  
wrote:

> Currently there seems to be an assumption in tmpfs that 64-bit
> architectures also have a 64-bit ino_t. This is not true; s390 at
> least has a 32-bit ino_t. With CONFIG_TMPFS_INODE64=y tmpfs
> mounts will get 64-bit inode numbers and display "inode64" in the
> mount options, but passing the "inode64" mount option will fail.
> This leads to the following behavior:
> 
>  # mkdir mnt
>  # mount -t tmpfs nodev mnt
>  # mount -o remount,rw mnt
>  mount: /home/ubuntu/mnt: mount point not mounted or bad option.
> 
> As mount sees "inode64" in the mount options and thus passes it
> in the options for the remount.
> 
> Ideally CONFIG_TMPFS_INODE64 would depend on sizeof(ino_t) < 8,
> but I don't think it's possible to test for this (potentially
> CONFIG_ARCH_HAS_64BIT_INO_T or similar could be added, but I'm
> not sure whether or not that is wanted). So fix this by simply
> refusing to honor the CONFIG_TMPFS_INODE64 setting when
> sizeof(ino_t) < 8.

How about changing s390 Kconfig so that CONFIG_TMPFS_INODE64 is not enabled?

> Fixes: ea3271f7196c ("tmpfs: support 64-bit inums per-sb")
> Signed-off-by: Seth Forshee 

With a cc:stable, I assume?

[PATCH] tmpfs: Don't use 64-bit inodes by defulat with 32-bit ino_t

2021-02-05 Thread Seth Forshee

Currently there seems to be an assumption in tmpfs that 64-bit
architectures also have a 64-bit ino_t. This is not true; s390 at
least has a 32-bit ino_t. With CONFIG_TMPFS_INODE64=y tmpfs
mounts will get 64-bit inode numbers and display "inode64" in the
mount options, but passing the "inode64" mount option will fail.
This leads to the following behavior:

 # mkdir mnt
 # mount -t tmpfs nodev mnt
 # mount -o remount,rw mnt
 mount: /home/ubuntu/mnt: mount point not mounted or bad option.

As mount sees "inode64" in the mount options and thus passes it
in the options for the remount.

Ideally CONFIG_TMPFS_INODE64 would depend on sizeof(ino_t) < 8,
but I don't think it's possible to test for this (potentially
CONFIG_ARCH_HAS_64BIT_INO_T or similar could be added, but I'm
not sure whether or not that is wanted). So fix this by simply
refusing to honor the CONFIG_TMPFS_INODE64 setting when
sizeof(ino_t) < 8.

Fixes: ea3271f7196c ("tmpfs: support 64-bit inums per-sb")
Signed-off-by: Seth Forshee 
---
 mm/shmem.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index 7c6b6d8f6c39..efde42acdc7a 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -3733,7 +3733,7 @@ static int shmem_fill_super(struct super_block *sb, 
struct fs_context *fc)
ctx->blocks = shmem_default_max_blocks();
if (!(ctx->seen & SHMEM_SEEN_INODES))
ctx->inodes = shmem_default_max_inodes();
-   if (!(ctx->seen & SHMEM_SEEN_INUMS))
+   if (!(ctx->seen & SHMEM_SEEN_INUMS) && sizeof(ino_t) >= 8)
ctx->full_inums = IS_ENABLED(CONFIG_TMPFS_INODE64);
} else {
sb->s_flags |= SB_NOUSER;
-- 
2.29.2

Re: [PATCH] ecryptfs: Fix inodes not being evicted until unmount

2021-02-02 Thread Jeffrey Mitchell

On Sat, Jan 30, 2021 at 11:06:40AM -0600, Tyler Hicks wrote:
> Hey Jeffrey and Dan - thanks for the patch! Unfortunately, I think this
> would allow the eCryptfs inode's nlink count to get out of sync with the
> lower inode's nlink count in the case of direct manipulation to the
> lower filesystem.

Hmm. What if I instead synchronize it before calling vfs_unlink(), then
call drop_nlink() if vfs_unlink() succeeds?

> Is the condition that you're trying to fix a result of going through the
> this code path?
> 
>  ecryptfs_unlink() -> ecryptfs_do_unlink() -> vfs_unlink() -> nfs_unlink() -> 
> nfs_sillyrename() -> nfs_async_unlink()

Yes, that is the code path that causes it.

V/R,
Jeffrey Mitchell

Re: [PATCH] ecryptfs: Fix inodes not being evicted until unmount

2021-01-30 Thread Tyler Hicks

On 2020-12-18 13:07:30, Jeffrey Mitchell wrote:
> On asynchronous base filesystems like NFS, eCryptFS leaves inodes for
> deleted files in the cache until unmounting. Change call in
> ecryptfs_do_unlink() from set_nlink() to drop_nlink() in order to reliably
> evict inodes from the cache even on top of NFS.
> 
> Signed-off-by: Dan Robertson 
> Signed-off-by: Jeffrey Mitchell 

Hey Jeffrey and Dan - thanks for the patch! Unfortunately, I think this
would allow the eCryptfs inode's nlink count to get out of sync with the
lower inode's nlink count in the case of direct manipulation to the
lower filesystem.

Is the condition that you're trying to fix a result of going through the
this code path?

 ecryptfs_unlink() -> ecryptfs_do_unlink() -> vfs_unlink() -> nfs_unlink() -> 
nfs_sillyrename() -> nfs_async_unlink()

Tyler

> ---
>  fs/ecryptfs/inode.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/fs/ecryptfs/inode.c b/fs/ecryptfs/inode.c
> index e23752d..f7594b6 100644
> --- a/fs/ecryptfs/inode.c
> +++ b/fs/ecryptfs/inode.c
> @@ -147,7 +147,7 @@ static int ecryptfs_do_unlink(struct inode *dir, struct 
> dentry *dentry,
>   goto out_unlock;
>   }
>   fsstack_copy_attr_times(dir, lower_dir_inode);
> - set_nlink(inode, ecryptfs_inode_to_lower(inode)->i_nlink);
> + drop_nlink(inode);
>   inode->i_ctime = dir->i_ctime;
>  out_unlock:
>   dput(lower_dentry);
> -- 
> 2.7.4
>

[PATCH 5.10 012/103] btrfs: shrink delalloc pages instead of full inodes

2021-01-15 Thread Greg Kroah-Hartman

From: Josef Bacik 

[ Upstream commit e076ab2a2ca70a0270232067cd49f76cd92efe64 ]

Commit 38d715f494f2 ("btrfs: use btrfs_start_delalloc_roots in
shrink_delalloc") cleaned up how we do delalloc shrinking by utilizing
some infrastructure we have in place to flush inodes that we use for
device replace and snapshot.  However this introduced a pretty serious
performance regression.  To reproduce the user untarred the source
tarball of Firefox (360MiB xz compressed/1.5GiB uncompressed), and would
see it take anywhere from 5 to 20 times as long to untar in 5.10
compared to 5.9. This was observed on fast devices (SSD and better) and
not on HDD.

The root cause is because before we would generally use the normal
writeback path to reclaim delalloc space, and for this we would provide
it with the number of pages we wanted to flush.  The referenced commit
changed this to flush that many inodes, which drastically increased the
amount of space we were flushing in certain cases, which severely
affected performance.

We cannot revert this patch unfortunately because of 3d45f221ce62
("btrfs: fix deadlock when cloning inline extent and low on free
metadata space") which requires the ability to skip flushing inodes that
are being cloned in certain scenarios, which means we need to keep using
our flushing infrastructure or risk re-introducing the deadlock.

Instead to fix this problem we can go back to providing
btrfs_start_delalloc_roots with a number of pages to flush, and then set
up a writeback_control and utilize sync_inode() to handle the flushing
for us.  This gives us the same behavior we had prior to the fix, while
still allowing us to avoid the deadlock that was fixed by Filipe.  I
redid the users original test and got the following results on one of
our test machines (256GiB of ram, 56 cores, 2TiB Intel NVMe drive)

  5.9   0m54.258s
  5.10  1m26.212s
  5.10+patch0m38.800s

5.10+patch is significantly faster than plain 5.9 because of my patch
series "Change data reservations to use the ticketing infra" which
contained the patch that introduced the regression, but generally
improved the overall ENOSPC flushing mechanisms.

Additional testing on consumer-grade SSD (8GiB ram, 8 CPU) confirm
the results:

  5.10.54m00s
  5.10.5+patch  1m08s
  5.11-rc2  5m14s
  5.11-rc2+patch1m30s

Reported-by: René Rebe 
Fixes: 38d715f494f2 ("btrfs: use btrfs_start_delalloc_roots in shrink_delalloc")
CC: sta...@vger.kernel.org # 5.10
Signed-off-by: Josef Bacik 
Tested-by: David Sterba 
Reviewed-by: David Sterba 
[ add my test results ]
Signed-off-by: David Sterba 
Signed-off-by: Sasha Levin 
---
 fs/btrfs/inode.c  | 60 +++
 fs/btrfs/space-info.c |  4 ++-
 2 files changed, 46 insertions(+), 18 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 07479250221da..acc47e2ffb46b 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -9389,7 +9389,8 @@ static struct btrfs_delalloc_work 
*btrfs_alloc_delalloc_work(struct inode *inode
  * some fairly slow code that needs optimization. This walks the list
  * of all the inodes with pending delalloc and forces them to disk.
  */
-static int start_delalloc_inodes(struct btrfs_root *root, u64 *nr, bool 
snapshot,
+static int start_delalloc_inodes(struct btrfs_root *root,
+struct writeback_control *wbc, bool snapshot,
 bool in_reclaim_context)
 {
struct btrfs_inode *binode;
@@ -9398,6 +9399,7 @@ static int start_delalloc_inodes(struct btrfs_root *root, 
u64 *nr, bool snapshot
struct list_head works;
struct list_head splice;
int ret = 0;
+   bool full_flush = wbc->nr_to_write == LONG_MAX;
 
INIT_LIST_HEAD();
INIT_LIST_HEAD();
@@ -9426,18 +9428,24 @@ static int start_delalloc_inodes(struct btrfs_root 
*root, u64 *nr, bool snapshot
if (snapshot)
set_bit(BTRFS_INODE_SNAPSHOT_FLUSH,
>runtime_flags);
-   work = btrfs_alloc_delalloc_work(inode);
-   if (!work) {
-   iput(inode);
-   ret = -ENOMEM;
-   goto out;
-   }
-   list_add_tail(>list, );
-   btrfs_queue_work(root->fs_info->flush_workers,
->work);
-   if (*nr != U64_MAX) {
-   (*nr)--;
-   if (*nr == 0)
+   if (full_flush) {
+   work = btrfs_alloc_delalloc_work(inode);
+   if (!work) {
+   iput(inode);
+   ret = -ENOMEM;
+   goto out;
+   }
+   list_add_tail(>list, );
+   btrfs_queue_work(root->fs_inf

Re: [PATCH v15 0/4] SELinux support for anonymous inodes and UFFD

2021-01-14 Thread Lokesh Gidra

On Thu, Jan 14, 2021 at 2:47 PM Paul Moore  wrote:
>
> On Tue, Jan 12, 2021 at 12:15 PM Paul Moore  wrote:
> >
> > On Fri, Jan 8, 2021 at 5:22 PM Lokesh Gidra  wrote:
> > >
> > > Userfaultfd in unprivileged contexts could be potentially very
> > > useful. We'd like to harden userfaultfd to make such unprivileged use
> > > less risky. This patch series allows SELinux to manage userfaultfd
> > > file descriptors and in the future, other kinds of
> > > anonymous-inode-based file descriptor.
> >
> > ...
> >
> > > Daniel Colascione (3):
> > >   fs: add LSM-supporting anon-inode interface
> > >   selinux: teach SELinux about anonymous inodes
> > >   userfaultfd: use secure anon inodes for userfaultfd
> > >
> > > Lokesh Gidra (1):
> > >   security: add inode_init_security_anon() LSM hook
> > >
> > >  fs/anon_inodes.c| 150 
> > >  fs/libfs.c  |   5 -
> > >  fs/userfaultfd.c|  19 ++--
> > >  include/linux/anon_inodes.h |   5 +
> > >  include/linux/lsm_hook_defs.h   |   2 +
> > >  include/linux/lsm_hooks.h   |   9 ++
> > >  include/linux/security.h|  10 ++
> > >  security/security.c |   8 ++
> > >  security/selinux/hooks.c|  57 +++
> > >  security/selinux/include/classmap.h |   2 +
> > >  10 files changed, 213 insertions(+), 54 deletions(-)
> >
> > With several rounds of reviews done and the corresponding SELinux test
> > suite looking close to being ready I think it makes sense to merge
> > this via the SELinux tree.  VFS folks, if you have any comments or
> > objections please let me know soon.  If I don't hear anything within
> > the next day or two I'll go ahead and merge this for linux-next.
>
> With no comments over the last two days I merged the patchset into
> selinux/next.  Thanks for all your work and patience on this Lokesh.
>
Thanks so much.

> Also, it looks like you are very close to getting the associated
> SELinux test suite additions merged, please continue to work with
> Ondrej to get those merged soon.
>
Certainly! I'm waiting for his reviews for the latest patch.

> --
> paul moore
> www.paul-moore.com

Re: [PATCH v15 0/4] SELinux support for anonymous inodes and UFFD

2021-01-14 Thread Paul Moore

On Tue, Jan 12, 2021 at 12:15 PM Paul Moore  wrote:
>
> On Fri, Jan 8, 2021 at 5:22 PM Lokesh Gidra  wrote:
> >
> > Userfaultfd in unprivileged contexts could be potentially very
> > useful. We'd like to harden userfaultfd to make such unprivileged use
> > less risky. This patch series allows SELinux to manage userfaultfd
> > file descriptors and in the future, other kinds of
> > anonymous-inode-based file descriptor.
>
> ...
>
> > Daniel Colascione (3):
> >   fs: add LSM-supporting anon-inode interface
> >   selinux: teach SELinux about anonymous inodes
> >   userfaultfd: use secure anon inodes for userfaultfd
> >
> > Lokesh Gidra (1):
> >   security: add inode_init_security_anon() LSM hook
> >
> >  fs/anon_inodes.c| 150 
> >  fs/libfs.c  |   5 -
> >  fs/userfaultfd.c|  19 ++--
> >  include/linux/anon_inodes.h |   5 +
> >  include/linux/lsm_hook_defs.h   |   2 +
> >  include/linux/lsm_hooks.h   |   9 ++
> >  include/linux/security.h|  10 ++
> >  security/security.c |   8 ++
> >  security/selinux/hooks.c|  57 +++
> >  security/selinux/include/classmap.h |   2 +
> >  10 files changed, 213 insertions(+), 54 deletions(-)
>
> With several rounds of reviews done and the corresponding SELinux test
> suite looking close to being ready I think it makes sense to merge
> this via the SELinux tree.  VFS folks, if you have any comments or
> objections please let me know soon.  If I don't hear anything within
> the next day or two I'll go ahead and merge this for linux-next.

With no comments over the last two days I merged the patchset into
selinux/next.  Thanks for all your work and patience on this Lokesh.

Also, it looks like you are very close to getting the associated
SELinux test suite additions merged, please continue to work with
Ondrej to get those merged soon.

-- 
paul moore
www.paul-moore.com

Re: [PATCH v15 0/4] SELinux support for anonymous inodes and UFFD

2021-01-12 Thread Paul Moore

On Fri, Jan 8, 2021 at 5:22 PM Lokesh Gidra  wrote:
>
> Userfaultfd in unprivileged contexts could be potentially very
> useful. We'd like to harden userfaultfd to make such unprivileged use
> less risky. This patch series allows SELinux to manage userfaultfd
> file descriptors and in the future, other kinds of
> anonymous-inode-based file descriptor.

...

> Daniel Colascione (3):
>   fs: add LSM-supporting anon-inode interface
>   selinux: teach SELinux about anonymous inodes
>   userfaultfd: use secure anon inodes for userfaultfd
>
> Lokesh Gidra (1):
>   security: add inode_init_security_anon() LSM hook
>
>  fs/anon_inodes.c| 150 
>  fs/libfs.c  |   5 -
>  fs/userfaultfd.c|  19 ++--
>  include/linux/anon_inodes.h |   5 +
>  include/linux/lsm_hook_defs.h   |   2 +
>  include/linux/lsm_hooks.h   |   9 ++
>  include/linux/security.h|  10 ++
>  security/security.c |   8 ++
>  security/selinux/hooks.c|  57 +++
>  security/selinux/include/classmap.h |   2 +
>  10 files changed, 213 insertions(+), 54 deletions(-)

With several rounds of reviews done and the corresponding SELinux test
suite looking close to being ready I think it makes sense to merge
this via the SELinux tree.  VFS folks, if you have any comments or
objections please let me know soon.  If I don't hear anything within
the next day or two I'll go ahead and merge this for linux-next.

Thanks.

-- 
paul moore
www.paul-moore.com

[PATCH v15 4/4] userfaultfd: use secure anon inodes for userfaultfd

2021-01-08 Thread Lokesh Gidra

From: Daniel Colascione 

This change gives userfaultfd file descriptors a real security
context, allowing policy to act on them.

Signed-off-by: Daniel Colascione 

[LG: Remove owner inode from userfaultfd_ctx]
[LG: Use anon_inode_getfd_secure() instead of anon_inode_getfile_secure()
 in userfaultfd syscall]
[LG: Use inode of file in userfaultfd_read() in resolve_userfault_fork()]

Signed-off-by: Lokesh Gidra 
Reviewed-by: Eric Biggers 
---
 fs/userfaultfd.c | 19 ++-
 1 file changed, 10 insertions(+), 9 deletions(-)

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 894cc28142e7..0be8cdd4425a 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -979,14 +979,14 @@ static __poll_t userfaultfd_poll(struct file *file, 
poll_table *wait)
 
 static const struct file_operations userfaultfd_fops;
 
-static int resolve_userfault_fork(struct userfaultfd_ctx *ctx,
- struct userfaultfd_ctx *new,
+static int resolve_userfault_fork(struct userfaultfd_ctx *new,
+ struct inode *inode,
  struct uffd_msg *msg)
 {
int fd;
 
-   fd = anon_inode_getfd("[userfaultfd]", _fops, new,
- O_RDWR | (new->flags & UFFD_SHARED_FCNTL_FLAGS));
+   fd = anon_inode_getfd_secure("[userfaultfd]", _fops, new,
+   O_RDWR | (new->flags & UFFD_SHARED_FCNTL_FLAGS), inode);
if (fd < 0)
return fd;
 
@@ -996,7 +996,7 @@ static int resolve_userfault_fork(struct userfaultfd_ctx 
*ctx,
 }
 
 static ssize_t userfaultfd_ctx_read(struct userfaultfd_ctx *ctx, int no_wait,
-   struct uffd_msg *msg)
+   struct uffd_msg *msg, struct inode *inode)
 {
ssize_t ret;
DECLARE_WAITQUEUE(wait, current);
@@ -1107,7 +1107,7 @@ static ssize_t userfaultfd_ctx_read(struct 
userfaultfd_ctx *ctx, int no_wait,
spin_unlock_irq(>fd_wqh.lock);
 
if (!ret && msg->event == UFFD_EVENT_FORK) {
-   ret = resolve_userfault_fork(ctx, fork_nctx, msg);
+   ret = resolve_userfault_fork(fork_nctx, inode, msg);
spin_lock_irq(>event_wqh.lock);
if (!list_empty(_event)) {
/*
@@ -1167,6 +1167,7 @@ static ssize_t userfaultfd_read(struct file *file, char 
__user *buf,
ssize_t _ret, ret = 0;
struct uffd_msg msg;
int no_wait = file->f_flags & O_NONBLOCK;
+   struct inode *inode = file_inode(file);
 
if (ctx->state == UFFD_STATE_WAIT_API)
return -EINVAL;
@@ -1174,7 +1175,7 @@ static ssize_t userfaultfd_read(struct file *file, char 
__user *buf,
for (;;) {
if (count < sizeof(msg))
return ret ? ret : -EINVAL;
-   _ret = userfaultfd_ctx_read(ctx, no_wait, );
+   _ret = userfaultfd_ctx_read(ctx, no_wait, , inode);
if (_ret < 0)
return ret ? ret : _ret;
if (copy_to_user((__u64 __user *) buf, , sizeof(msg)))
@@ -1999,8 +2000,8 @@ SYSCALL_DEFINE1(userfaultfd, int, flags)
/* prevent the mm struct to be freed */
mmgrab(ctx->mm);
 
-   fd = anon_inode_getfd("[userfaultfd]", _fops, ctx,
- O_RDWR | (flags & UFFD_SHARED_FCNTL_FLAGS));
+   fd = anon_inode_getfd_secure("[userfaultfd]", _fops, ctx,
+   O_RDWR | (flags & UFFD_SHARED_FCNTL_FLAGS), NULL);
if (fd < 0) {
mmdrop(ctx->mm);
kmem_cache_free(userfaultfd_ctx_cachep, ctx);
-- 
2.30.0.284.gd98b1dd5eaa7-goog

[PATCH v15 0/4] SELinux support for anonymous inodes and UFFD

2021-01-08 Thread Lokesh Gidra

Userfaultfd in unprivileged contexts could be potentially very
useful. We'd like to harden userfaultfd to make such unprivileged use
less risky. This patch series allows SELinux to manage userfaultfd
file descriptors and in the future, other kinds of
anonymous-inode-based file descriptor.  SELinux policy authors can
apply policy types to anonymous inodes by providing name-based
transition rules keyed off the anonymous inode internal name (
"[userfaultfd]" in the case of userfaultfd(2) file descriptors) and
applying policy to the new SIDs thus produced.

With SELinux managed userfaultfd, an admin can control creation and
movement of the file descriptors. In particular, handling of
a userfaultfd descriptor by a different process is essentially a
ptrace access into the process, without any of the corresponding
security_ptrace_access_check() checks. For privacy, the admin may
want to deny such accesses, which is possible with SELinux support.

Inside the kernel, a new anon_inode interface, anon_inode_getfd_secure,
allows callers to opt into this SELinux management. In this new "secure"
mode, anon_inodes create new ephemeral inodes for anonymous file objects
instead of reusing the normal anon_inodes singleton dummy inode. A new
LSM hook gives security modules an opportunity to configure and veto
these ephemeral inodes.

This patch series is one of two fork of [1] and is an
alternative to [2].

The primary difference between the two patch series is that this
partch series creates a unique inode for each "secure" anonymous
inode, while the other patch series ([2]) continues using the
singleton dummy anonymous inode and adds a way to attach SELinux
security information directly to file objects.

I prefer the approach in this patch series because 1) it's a smaller
patch than [2], and 2) it produces a more regular security
architecture: in this patch series, secure anonymous inodes aren't
S_PRIVATE and they maintain the SELinux property that the label for a
file is in its inode. We do need an additional inode per anonymous
file, but per-struct-file inode creation doesn't seem to be a problem
for pipes and sockets.

The previous version of this feature ([1]) created a new SELinux
security class for userfaultfd file descriptors. This version adopts
the generic transition-based approach of [2].

This patch series also differs from [2] in that it doesn't affect all
anonymous inodes right away --- instead requiring anon_inodes callers
to opt in --- but this difference isn't one of basic approach. The
important question to resolve is whether we should be creating new
inodes or enhancing per-file data.

Changes from the first version of the patch:

  - Removed some error checks
  - Defined a new anon_inode SELinux class to resolve the
ambiguity in [3]
  - Inherit sclass as well as descriptor from context inode

Changes from the second version of the patch:

  - Fixed example policy in the commit message to reflect the use of
the new anon_inode class.

Changes from the third version of the patch:

  - Dropped the fops parameter to the LSM hook
  - Documented hook parameters
  - Fixed incorrect class used for SELinux transition
  - Removed stray UFFD changed early in the series
  - Removed a redundant ERR_PTR(PTR_ERR())

Changes from the fourth version of the patch:

  - Removed an unused parameter from an internal function
  - Fixed function documentation

Changes from the fifth version of the patch:

  - Fixed function documentation in fs/anon_inodes.c and
include/linux/lsm_hooks.h
  - Used anon_inode_getfd_secure() in userfaultfd() syscall and removed
owner from userfaultfd_ctx.

Changes from the sixth version of the patch:

  - Removed definition of anon_inode_getfile_secure() as there are no
callers.
  - Simplified function description of anon_inode_getfd_secure().
  - Elaborated more on the purpose of 'context_inode' in commit message.

Changes from the seventh version of the patch:

  - Fixed error handling in _anon_inode_getfile().
  - Fixed minor comment and indentation related issues.

Changes from the eighth version of the patch:

  - Replaced selinux_state.initialized with selinux_state.initialized

Changes from the ninth version of the patch:

  - Fixed function names in fs/anon_inodes.c
  - Fixed comment of anon_inode_getfd_secure()
  - Fixed name of the patch wherein userfaultfd code uses
anon_inode_getfd_secure()

Changes from the tenth version of the patch:

  - Split first patch into VFS and LSM specific patches
  - Fixed comments in fs/anon_inodes.c
  - Fixed comment of alloc_anon_inode()

Changes from the eleventh version of the patch:

  - Removed comment of alloc_anon_inode() for consistency with the code
  - Fixed explanation of LSM hook in the commit message

Changes from the twelfth version of the patch:
  - Replaced FILE__CREATE with ANON_INODE__CREATE while initializing
anon-inode's SELinux security struct.
  - Check context_inode's SE

[PATCH v15 3/4] selinux: teach SELinux about anonymous inodes

2021-01-08 Thread Lokesh Gidra

From: Daniel Colascione 

This change uses the anon_inodes and LSM infrastructure introduced in
the previous patches to give SELinux the ability to control
anonymous-inode files that are created using the new
anon_inode_getfd_secure() function.

A SELinux policy author detects and controls these anonymous inodes by
adding a name-based type_transition rule that assigns a new security
type to anonymous-inode files created in some domain. The name used
for the name-based transition is the name associated with the
anonymous inode for file listings --- e.g., "[userfaultfd]" or
"[perf_event]".

Example:

type uffd_t;
type_transition sysadm_t sysadm_t : anon_inode uffd_t "[userfaultfd]";
allow sysadm_t uffd_t:anon_inode { create };

(The next patch in this series is necessary for making userfaultfd
support this new interface.  The example above is just
for exposition.)

Signed-off-by: Daniel Colascione 
Signed-off-by: Lokesh Gidra 
---
 security/selinux/hooks.c| 57 +
 security/selinux/include/classmap.h |  2 +
 2 files changed, 59 insertions(+)

diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
index 644b17ec9e63..a5e12b2fabde 100644
--- a/security/selinux/hooks.c
+++ b/security/selinux/hooks.c
@@ -2934,6 +2934,62 @@ static int selinux_inode_init_security(struct inode 
*inode, struct inode *dir,
return 0;
 }
 
+static int selinux_inode_init_security_anon(struct inode *inode,
+   const struct qstr *name,
+   const struct inode *context_inode)
+{
+   const struct task_security_struct *tsec = selinux_cred(current_cred());
+   struct common_audit_data ad;
+   struct inode_security_struct *isec;
+   int rc;
+
+   if (unlikely(!selinux_initialized(_state)))
+   return 0;
+
+   isec = selinux_inode(inode);
+
+   /*
+* We only get here once per ephemeral inode.  The inode has
+* been initialized via inode_alloc_security but is otherwise
+* untouched.
+*/
+
+   if (context_inode) {
+   struct inode_security_struct *context_isec =
+   selinux_inode(context_inode);
+   if (context_isec->initialized != LABEL_INITIALIZED) {
+   pr_err("SELinux:  context_inode is not initialized");
+   return -EACCES;
+   }
+
+   isec->sclass = context_isec->sclass;
+   isec->sid = context_isec->sid;
+   } else {
+   isec->sclass = SECCLASS_ANON_INODE;
+   rc = security_transition_sid(
+   _state, tsec->sid, tsec->sid,
+   isec->sclass, name, >sid);
+   if (rc)
+   return rc;
+   }
+
+   isec->initialized = LABEL_INITIALIZED;
+   /*
+* Now that we've initialized security, check whether we're
+* allowed to actually create this type of anonymous inode.
+*/
+
+   ad.type = LSM_AUDIT_DATA_INODE;
+   ad.u.inode = inode;
+
+   return avc_has_perm(_state,
+   tsec->sid,
+   isec->sid,
+   isec->sclass,
+   FILE__CREATE,
+   );
+}
+
 static int selinux_inode_create(struct inode *dir, struct dentry *dentry, 
umode_t mode)
 {
return may_create(dir, dentry, SECCLASS_FILE);
@@ -7000,6 +7056,7 @@ static struct security_hook_list selinux_hooks[] 
__lsm_ro_after_init = {
 
LSM_HOOK_INIT(inode_free_security, selinux_inode_free_security),
LSM_HOOK_INIT(inode_init_security, selinux_inode_init_security),
+   LSM_HOOK_INIT(inode_init_security_anon, 
selinux_inode_init_security_anon),
LSM_HOOK_INIT(inode_create, selinux_inode_create),
LSM_HOOK_INIT(inode_link, selinux_inode_link),
LSM_HOOK_INIT(inode_unlink, selinux_inode_unlink),
diff --git a/security/selinux/include/classmap.h 
b/security/selinux/include/classmap.h
index 40cebde62856..ba2e01a6955c 100644
--- a/security/selinux/include/classmap.h
+++ b/security/selinux/include/classmap.h
@@ -249,6 +249,8 @@ struct security_class_mapping secclass_map[] = {
  {"open", "cpu", "kernel", "tracepoint", "read", "write"} },
{ "lockdown",
  { "integrity", "confidentiality", NULL } },
+   { "anon_inode",
+ { COMMON_FILE_PERMS, NULL } },
{ NULL }
   };
 
-- 
2.30.0.284.gd98b1dd5eaa7-goog

Re: [PATCH v13 3/4] selinux: teach SELinux about anonymous inodes

2021-01-08 Thread Lokesh Gidra

On Fri, Jan 8, 2021 at 1:24 PM Stephen Smalley
 wrote:
>
> On Fri, Jan 8, 2021 at 3:17 PM Lokesh Gidra  wrote:
> >
> > On Fri, Jan 8, 2021 at 11:35 AM Stephen Smalley
> >  wrote:
> > >
> > > On Wed, Jan 6, 2021 at 10:03 PM Paul Moore  wrote:
> > > >
> > > > On Wed, Nov 11, 2020 at 8:54 PM Lokesh Gidra  
> > > > wrote:
> > > > > From: Daniel Colascione 
> > > > >
> > > > > This change uses the anon_inodes and LSM infrastructure introduced in
> > > > > the previous patches to give SELinux the ability to control
> > > > > anonymous-inode files that are created using the new
> > > > > anon_inode_getfd_secure() function.
> > > > >
> > > > > A SELinux policy author detects and controls these anonymous inodes by
> > > > > adding a name-based type_transition rule that assigns a new security
> > > > > type to anonymous-inode files created in some domain. The name used
> > > > > for the name-based transition is the name associated with the
> > > > > anonymous inode for file listings --- e.g., "[userfaultfd]" or
> > > > > "[perf_event]".
> > > > >
> > > > > Example:
> > > > >
> > > > > type uffd_t;
> > > > > type_transition sysadm_t sysadm_t : anon_inode uffd_t "[userfaultfd]";
> > > > > allow sysadm_t uffd_t:anon_inode { create };
> > > > >
> > > > > (The next patch in this series is necessary for making userfaultfd
> > > > > support this new interface.  The example above is just
> > > > > for exposition.)
> > > > >
> > > > > Signed-off-by: Daniel Colascione 
> > > > > Signed-off-by: Lokesh Gidra 
> > > > > ---
> > > > >  security/selinux/hooks.c| 56 
> > > > > +
> > > > >  security/selinux/include/classmap.h |  2 ++
> > > > >  2 files changed, 58 insertions(+)
> > > > >
> > > > > diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
> > > > > index 6b1826fc3658..d092aa512868 100644
> > > > > --- a/security/selinux/hooks.c
> > > > > +++ b/security/selinux/hooks.c
> > > > > @@ -2927,6 +2927,61 @@ static int selinux_inode_init_security(struct 
> > > > > inode *inode, struct inode *dir,
> > > > > return 0;
> > > > >  }
> > > > >
> > > > > +static int selinux_inode_init_security_anon(struct inode *inode,
> > > > > +   const struct qstr *name,
> > > > > +   const struct inode 
> > > > > *context_inode)
> > > > > +{
> > > > > +   const struct task_security_struct *tsec = 
> > > > > selinux_cred(current_cred());
> > > > > +   struct common_audit_data ad;
> > > > > +   struct inode_security_struct *isec;
> > > > > +   int rc;
> > > > > +
> > > > > +   if (unlikely(!selinux_initialized(_state)))
> > > > > +   return 0;
> > > > > +
> > > > > +   isec = selinux_inode(inode);
> > > > > +
> > > > > +   /*
> > > > > +* We only get here once per ephemeral inode.  The inode has
> > > > > +* been initialized via inode_alloc_security but is otherwise
> > > > > +* untouched.
> > > > > +*/
> > > > > +
> > > > > +   if (context_inode) {
> > > > > +   struct inode_security_struct *context_isec =
> > > > > +   selinux_inode(context_inode);
> > > > > +   if (context_isec->initialized != LABEL_INITIALIZED)
> > > > > +   return -EACCES;
> > Stephen, as per your explanation below, is this check also
> > problematic? I mean is it possible that /dev/kvm context_inode may not
> > have its label initialized? If so, then v12 of the patch series can be
> > used as is. Otherwise, I will send the next version which rollbacks
> > v14 and v13, except for this check. Kindly confirm.
>
> The context_inode should always be initialized already.  I'm not fond
> though of silently returning -EACCES here.  At the least we should
> have a pr_err() or pr_warn() here.  In reality, this could only occur
> in the case of a kernel bug or memory corruption so it used to be a
> candidate for WARN_ON() or BUG_ON() or similar but I know that
> BUG_ON() at least is frowned upon these days.

Got it. I'll add a pr_err(). Thanks a lot.

Re: [PATCH v13 3/4] selinux: teach SELinux about anonymous inodes

2021-01-08 Thread Stephen Smalley

On Fri, Jan 8, 2021 at 3:17 PM Lokesh Gidra  wrote:
>
> On Fri, Jan 8, 2021 at 11:35 AM Stephen Smalley
>  wrote:
> >
> > On Wed, Jan 6, 2021 at 10:03 PM Paul Moore  wrote:
> > >
> > > On Wed, Nov 11, 2020 at 8:54 PM Lokesh Gidra  
> > > wrote:
> > > > From: Daniel Colascione 
> > > >
> > > > This change uses the anon_inodes and LSM infrastructure introduced in
> > > > the previous patches to give SELinux the ability to control
> > > > anonymous-inode files that are created using the new
> > > > anon_inode_getfd_secure() function.
> > > >
> > > > A SELinux policy author detects and controls these anonymous inodes by
> > > > adding a name-based type_transition rule that assigns a new security
> > > > type to anonymous-inode files created in some domain. The name used
> > > > for the name-based transition is the name associated with the
> > > > anonymous inode for file listings --- e.g., "[userfaultfd]" or
> > > > "[perf_event]".
> > > >
> > > > Example:
> > > >
> > > > type uffd_t;
> > > > type_transition sysadm_t sysadm_t : anon_inode uffd_t "[userfaultfd]";
> > > > allow sysadm_t uffd_t:anon_inode { create };
> > > >
> > > > (The next patch in this series is necessary for making userfaultfd
> > > > support this new interface.  The example above is just
> > > > for exposition.)
> > > >
> > > > Signed-off-by: Daniel Colascione 
> > > > Signed-off-by: Lokesh Gidra 
> > > > ---
> > > >  security/selinux/hooks.c| 56 +
> > > >  security/selinux/include/classmap.h |  2 ++
> > > >  2 files changed, 58 insertions(+)
> > > >
> > > > diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
> > > > index 6b1826fc3658..d092aa512868 100644
> > > > --- a/security/selinux/hooks.c
> > > > +++ b/security/selinux/hooks.c
> > > > @@ -2927,6 +2927,61 @@ static int selinux_inode_init_security(struct 
> > > > inode *inode, struct inode *dir,
> > > > return 0;
> > > >  }
> > > >
> > > > +static int selinux_inode_init_security_anon(struct inode *inode,
> > > > +   const struct qstr *name,
> > > > +   const struct inode 
> > > > *context_inode)
> > > > +{
> > > > +   const struct task_security_struct *tsec = 
> > > > selinux_cred(current_cred());
> > > > +   struct common_audit_data ad;
> > > > +   struct inode_security_struct *isec;
> > > > +   int rc;
> > > > +
> > > > +   if (unlikely(!selinux_initialized(_state)))
> > > > +   return 0;
> > > > +
> > > > +   isec = selinux_inode(inode);
> > > > +
> > > > +   /*
> > > > +* We only get here once per ephemeral inode.  The inode has
> > > > +* been initialized via inode_alloc_security but is otherwise
> > > > +* untouched.
> > > > +*/
> > > > +
> > > > +   if (context_inode) {
> > > > +   struct inode_security_struct *context_isec =
> > > > +   selinux_inode(context_inode);
> > > > +   if (context_isec->initialized != LABEL_INITIALIZED)
> > > > +   return -EACCES;
> Stephen, as per your explanation below, is this check also
> problematic? I mean is it possible that /dev/kvm context_inode may not
> have its label initialized? If so, then v12 of the patch series can be
> used as is. Otherwise, I will send the next version which rollbacks
> v14 and v13, except for this check. Kindly confirm.

The context_inode should always be initialized already.  I'm not fond
though of silently returning -EACCES here.  At the least we should
have a pr_err() or pr_warn() here.  In reality, this could only occur
in the case of a kernel bug or memory corruption so it used to be a
candidate for WARN_ON() or BUG_ON() or similar but I know that
BUG_ON() at least is frowned upon these days.

Re: [PATCH v13 3/4] selinux: teach SELinux about anonymous inodes

2021-01-08 Thread Paul Moore

On Fri, Jan 8, 2021 at 2:35 PM Stephen Smalley
 wrote:
> On Wed, Jan 6, 2021 at 10:03 PM Paul Moore  wrote:
> > On Wed, Nov 11, 2020 at 8:54 PM Lokesh Gidra  wrote:
> > > From: Daniel Colascione 
> > >
> > > This change uses the anon_inodes and LSM infrastructure introduced in
> > > the previous patches to give SELinux the ability to control
> > > anonymous-inode files that are created using the new
> > > anon_inode_getfd_secure() function.
> > >
> > > A SELinux policy author detects and controls these anonymous inodes by
> > > adding a name-based type_transition rule that assigns a new security
> > > type to anonymous-inode files created in some domain. The name used
> > > for the name-based transition is the name associated with the
> > > anonymous inode for file listings --- e.g., "[userfaultfd]" or
> > > "[perf_event]".
> > >
> > > Example:
> > >
> > > type uffd_t;
> > > type_transition sysadm_t sysadm_t : anon_inode uffd_t "[userfaultfd]";
> > > allow sysadm_t uffd_t:anon_inode { create };
> > >
> > > (The next patch in this series is necessary for making userfaultfd
> > > support this new interface.  The example above is just
> > > for exposition.)
> > >
> > > Signed-off-by: Daniel Colascione 
> > > Signed-off-by: Lokesh Gidra 
> > > ---
> > >  security/selinux/hooks.c| 56 +
> > >  security/selinux/include/classmap.h |  2 ++
> > >  2 files changed, 58 insertions(+)
> > >
> > > diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
> > > index 6b1826fc3658..d092aa512868 100644
> > > --- a/security/selinux/hooks.c
> > > +++ b/security/selinux/hooks.c
> > > @@ -2927,6 +2927,61 @@ static int selinux_inode_init_security(struct 
> > > inode *inode, struct inode *dir,
> > > return 0;
> > >  }
> > >
> > > +static int selinux_inode_init_security_anon(struct inode *inode,
> > > +   const struct qstr *name,
> > > +   const struct inode 
> > > *context_inode)
> > > +{
> > > +   const struct task_security_struct *tsec = 
> > > selinux_cred(current_cred());
> > > +   struct common_audit_data ad;
> > > +   struct inode_security_struct *isec;
> > > +   int rc;
> > > +
> > > +   if (unlikely(!selinux_initialized(_state)))
> > > +   return 0;
> > > +
> > > +   isec = selinux_inode(inode);
> > > +
> > > +   /*
> > > +* We only get here once per ephemeral inode.  The inode has
> > > +* been initialized via inode_alloc_security but is otherwise
> > > +* untouched.
> > > +*/
> > > +
> > > +   if (context_inode) {
> > > +   struct inode_security_struct *context_isec =
> > > +   selinux_inode(context_inode);
> > > +   if (context_isec->initialized != LABEL_INITIALIZED)
> > > +   return -EACCES;
> > > +
> > > +   isec->sclass = context_isec->sclass;
> >
> > Taking the object class directly from the context_inode is
> > interesting, and I suspect problematic.  In the case below where no
> > context_inode is supplied the object class is set to
> > SECCLASS_ANON_INODE, which is correct, but when a context_inode is
> > supplied there is no guarantee that the object class will be set to
> > SECCLASS_ANON_INODE.  This could both pose a problem for policy
> > writers (how do you distinguish the anon inode from other normal file
> > inodes in this case?) as well as an outright fault later in this
> > function when we try to check the ANON_INODE__CREATE on an object
> > other than a SECCLASS_ANON_INODE object.
> >
> > It works in the userfaultfd case because the context_inode is
> > originally created with this function so the object class is correctly
> > set to SECCLASS_ANON_INODE, but can we always guarantee that to be the
> > case?  Do we ever need or want to support using a context_inode that
> > is not SECCLASS_ANON_INODE?
>
> Sorry, I haven't been following this.  IIRC, the original reason for
> passing a context_inode was to support the /dev/kvm or similar use
> cases where the driver is creating anonymous inodes to represent
> specific objects/interfaces derived from

Re: [PATCH v13 3/4] selinux: teach SELinux about anonymous inodes

2021-01-08 Thread Lokesh Gidra

On Fri, Jan 8, 2021 at 11:35 AM Stephen Smalley
 wrote:
>
> On Wed, Jan 6, 2021 at 10:03 PM Paul Moore  wrote:
> >
> > On Wed, Nov 11, 2020 at 8:54 PM Lokesh Gidra  wrote:
> > > From: Daniel Colascione 
> > >
> > > This change uses the anon_inodes and LSM infrastructure introduced in
> > > the previous patches to give SELinux the ability to control
> > > anonymous-inode files that are created using the new
> > > anon_inode_getfd_secure() function.
> > >
> > > A SELinux policy author detects and controls these anonymous inodes by
> > > adding a name-based type_transition rule that assigns a new security
> > > type to anonymous-inode files created in some domain. The name used
> > > for the name-based transition is the name associated with the
> > > anonymous inode for file listings --- e.g., "[userfaultfd]" or
> > > "[perf_event]".
> > >
> > > Example:
> > >
> > > type uffd_t;
> > > type_transition sysadm_t sysadm_t : anon_inode uffd_t "[userfaultfd]";
> > > allow sysadm_t uffd_t:anon_inode { create };
> > >
> > > (The next patch in this series is necessary for making userfaultfd
> > > support this new interface.  The example above is just
> > > for exposition.)
> > >
> > > Signed-off-by: Daniel Colascione 
> > > Signed-off-by: Lokesh Gidra 
> > > ---
> > >  security/selinux/hooks.c| 56 +
> > >  security/selinux/include/classmap.h |  2 ++
> > >  2 files changed, 58 insertions(+)
> > >
> > > diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
> > > index 6b1826fc3658..d092aa512868 100644
> > > --- a/security/selinux/hooks.c
> > > +++ b/security/selinux/hooks.c
> > > @@ -2927,6 +2927,61 @@ static int selinux_inode_init_security(struct 
> > > inode *inode, struct inode *dir,
> > > return 0;
> > >  }
> > >
> > > +static int selinux_inode_init_security_anon(struct inode *inode,
> > > +   const struct qstr *name,
> > > +   const struct inode 
> > > *context_inode)
> > > +{
> > > +   const struct task_security_struct *tsec = 
> > > selinux_cred(current_cred());
> > > +   struct common_audit_data ad;
> > > +   struct inode_security_struct *isec;
> > > +   int rc;
> > > +
> > > +   if (unlikely(!selinux_initialized(_state)))
> > > +   return 0;
> > > +
> > > +   isec = selinux_inode(inode);
> > > +
> > > +   /*
> > > +* We only get here once per ephemeral inode.  The inode has
> > > +* been initialized via inode_alloc_security but is otherwise
> > > +* untouched.
> > > +*/
> > > +
> > > +   if (context_inode) {
> > > +   struct inode_security_struct *context_isec =
> > > +   selinux_inode(context_inode);
> > > +   if (context_isec->initialized != LABEL_INITIALIZED)
> > > +   return -EACCES;
Stephen, as per your explanation below, is this check also
problematic? I mean is it possible that /dev/kvm context_inode may not
have its label initialized? If so, then v12 of the patch series can be
used as is. Otherwise, I will send the next version which rollbacks
v14 and v13, except for this check. Kindly confirm.

> > > +
> > > +   isec->sclass = context_isec->sclass;
> >
> > Taking the object class directly from the context_inode is
> > interesting, and I suspect problematic.  In the case below where no
> > context_inode is supplied the object class is set to
> > SECCLASS_ANON_INODE, which is correct, but when a context_inode is
> > supplied there is no guarantee that the object class will be set to
> > SECCLASS_ANON_INODE.  This could both pose a problem for policy
> > writers (how do you distinguish the anon inode from other normal file
> > inodes in this case?) as well as an outright fault later in this
> > function when we try to check the ANON_INODE__CREATE on an object
> > other than a SECCLASS_ANON_INODE object.
> >
> > It works in the userfaultfd case because the context_inode is
> > originally created with this function so the object class is correctly
> > set to SECCLASS_ANON_INODE, but can we always guarantee that to be the
> > case?  Do we ever need or want to

Re: [PATCH v14 3/4] selinux: teach SELinux about anonymous inodes

2021-01-08 Thread Stephen Smalley

On Fri, Jan 8, 2021 at 12:33 AM Lokesh Gidra  wrote:
>
> From: Daniel Colascione 
>
> This change uses the anon_inodes and LSM infrastructure introduced in
> the previous patches to give SELinux the ability to control
> anonymous-inode files that are created using the new
> anon_inode_getfd_secure() function.
>
> A SELinux policy author detects and controls these anonymous inodes by
> adding a name-based type_transition rule that assigns a new security
> type to anonymous-inode files created in some domain. The name used
> for the name-based transition is the name associated with the
> anonymous inode for file listings --- e.g., "[userfaultfd]" or
> "[perf_event]".
>
> Example:
>
> type uffd_t;
> type_transition sysadm_t sysadm_t : anon_inode uffd_t "[userfaultfd]";
> allow sysadm_t uffd_t:anon_inode { create };
>
> (The next patch in this series is necessary for making userfaultfd
> support this new interface.  The example above is just
> for exposition.)
>
> Signed-off-by: Daniel Colascione 
> Signed-off-by: Lokesh Gidra 
> ---
>  security/selinux/hooks.c| 59 +
>  security/selinux/include/classmap.h |  2 +
>  2 files changed, 61 insertions(+)
>
> diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
> index 644b17ec9e63..8b4e155b2930 100644
> --- a/security/selinux/hooks.c
> +++ b/security/selinux/hooks.c
> @@ -2934,6 +2933,63 @@ static int selinux_inode_init_security(struct inode 
> *inode, struct inode *dir,
> return 0;
>  }
>
> +static int selinux_inode_init_security_anon(struct inode *inode,
> +   const struct qstr *name,
> +   const struct inode *context_inode)
> +{
> +   const struct task_security_struct *tsec = 
> selinux_cred(current_cred());
> +   struct common_audit_data ad;
> +   struct inode_security_struct *isec;
> +   int rc;
> +
> +   if (unlikely(!selinux_initialized(_state)))
> +   return 0;
> +
> +   isec = selinux_inode(inode);
> +
> +   /*
> +* We only get here once per ephemeral inode.  The inode has
> +* been initialized via inode_alloc_security but is otherwise
> +* untouched.
> +*/
> +   isec->initialized = LABEL_INITIALIZED;
> +   isec->sclass = SECCLASS_ANON_INODE;
> +
> +   if (context_inode) {
> +   struct inode_security_struct *context_isec =
> +   selinux_inode(context_inode);
> +   if (context_isec->initialized != LABEL_INITIALIZED)
> +   return -EACCES;
> +   if (context_isec->sclass != SECCLASS_ANON_INODE) {
> +   pr_err("SELinux:  initializing anonymous inode with 
> non-anonymous inode");
> +   return -EACCES;
> +   }

This would preclude using this facility for anonymous inodes created
by kvm and other use cases.
Don't do this.

> +
> +   isec->sid = context_isec->sid;
> +   } else {
> +   rc = security_transition_sid(
> +   _state, tsec->sid, tsec->sid,
> +   isec->sclass, name, >sid);
> +   if (rc)
> +   return rc;
> +   }
> +
> +   /*
> +* Now that we've initialized security, check whether we're
> +* allowed to actually create this type of anonymous inode.
> +*/
> +
> +   ad.type = LSM_AUDIT_DATA_INODE;
> +   ad.u.inode = inode;
> +
> +   return avc_has_perm(_state,
> +   tsec->sid,
> +   isec->sid,
> +   isec->sclass,
> +   ANON_INODE__CREATE,

FILE__CREATE is perfectly appropriate here, not that it makes any difference.

Re: [PATCH v13 3/4] selinux: teach SELinux about anonymous inodes

2021-01-08 Thread Stephen Smalley

On Wed, Jan 6, 2021 at 10:03 PM Paul Moore  wrote:
>
> On Wed, Nov 11, 2020 at 8:54 PM Lokesh Gidra  wrote:
> > From: Daniel Colascione 
> >
> > This change uses the anon_inodes and LSM infrastructure introduced in
> > the previous patches to give SELinux the ability to control
> > anonymous-inode files that are created using the new
> > anon_inode_getfd_secure() function.
> >
> > A SELinux policy author detects and controls these anonymous inodes by
> > adding a name-based type_transition rule that assigns a new security
> > type to anonymous-inode files created in some domain. The name used
> > for the name-based transition is the name associated with the
> > anonymous inode for file listings --- e.g., "[userfaultfd]" or
> > "[perf_event]".
> >
> > Example:
> >
> > type uffd_t;
> > type_transition sysadm_t sysadm_t : anon_inode uffd_t "[userfaultfd]";
> > allow sysadm_t uffd_t:anon_inode { create };
> >
> > (The next patch in this series is necessary for making userfaultfd
> > support this new interface.  The example above is just
> > for exposition.)
> >
> > Signed-off-by: Daniel Colascione 
> > Signed-off-by: Lokesh Gidra 
> > ---
> >  security/selinux/hooks.c| 56 +
> >  security/selinux/include/classmap.h |  2 ++
> >  2 files changed, 58 insertions(+)
> >
> > diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
> > index 6b1826fc3658..d092aa512868 100644
> > --- a/security/selinux/hooks.c
> > +++ b/security/selinux/hooks.c
> > @@ -2927,6 +2927,61 @@ static int selinux_inode_init_security(struct inode 
> > *inode, struct inode *dir,
> > return 0;
> >  }
> >
> > +static int selinux_inode_init_security_anon(struct inode *inode,
> > +   const struct qstr *name,
> > +   const struct inode 
> > *context_inode)
> > +{
> > +   const struct task_security_struct *tsec = 
> > selinux_cred(current_cred());
> > +   struct common_audit_data ad;
> > +   struct inode_security_struct *isec;
> > +   int rc;
> > +
> > +   if (unlikely(!selinux_initialized(_state)))
> > +   return 0;
> > +
> > +   isec = selinux_inode(inode);
> > +
> > +   /*
> > +* We only get here once per ephemeral inode.  The inode has
> > +* been initialized via inode_alloc_security but is otherwise
> > +* untouched.
> > +*/
> > +
> > +   if (context_inode) {
> > +   struct inode_security_struct *context_isec =
> > +   selinux_inode(context_inode);
> > +   if (context_isec->initialized != LABEL_INITIALIZED)
> > +   return -EACCES;
> > +
> > +   isec->sclass = context_isec->sclass;
>
> Taking the object class directly from the context_inode is
> interesting, and I suspect problematic.  In the case below where no
> context_inode is supplied the object class is set to
> SECCLASS_ANON_INODE, which is correct, but when a context_inode is
> supplied there is no guarantee that the object class will be set to
> SECCLASS_ANON_INODE.  This could both pose a problem for policy
> writers (how do you distinguish the anon inode from other normal file
> inodes in this case?) as well as an outright fault later in this
> function when we try to check the ANON_INODE__CREATE on an object
> other than a SECCLASS_ANON_INODE object.
>
> It works in the userfaultfd case because the context_inode is
> originally created with this function so the object class is correctly
> set to SECCLASS_ANON_INODE, but can we always guarantee that to be the
> case?  Do we ever need or want to support using a context_inode that
> is not SECCLASS_ANON_INODE?

Sorry, I haven't been following this.  IIRC, the original reason for
passing a context_inode was to support the /dev/kvm or similar use
cases where the driver is creating anonymous inodes to represent
specific objects/interfaces derived from the device node and we want
to be able to control subsequent ioctl operations on those anonymous
inodes in the same manner as for the device node.  For example, ioctl
operations on /dev/kvm can end up returning file descriptors for
anonymous inodes representing a specific VM or VCPU or similar.  If we
propagate the security class and SID from the /dev/kvm inode (the
context inode) to the new anonymous inode, we can write a single
policy rule over all ioctl operations related to /dev/kvm.  That's
also why we used the FILE__CREATE permission here originally; that was
also intentional.  All the file-related classes including anon_inode
inherit a common set of file permissions including create and thus we
often use the FILE__ in common code when checking
permission against any potentially derived class.

[PATCH v14 4/4] userfaultfd: use secure anon inodes for userfaultfd

2021-01-07 Thread Lokesh Gidra

From: Daniel Colascione 

This change gives userfaultfd file descriptors a real security
context, allowing policy to act on them.

Signed-off-by: Daniel Colascione 

[LG: Remove owner inode from userfaultfd_ctx]
[LG: Use anon_inode_getfd_secure() instead of anon_inode_getfile_secure()
 in userfaultfd syscall]
[LG: Use inode of file in userfaultfd_read() in resolve_userfault_fork()]

Signed-off-by: Lokesh Gidra 
Reviewed-by: Eric Biggers 
---
 fs/userfaultfd.c | 19 ++-
 1 file changed, 10 insertions(+), 9 deletions(-)

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 894cc28142e7..0be8cdd4425a 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -979,14 +979,14 @@ static __poll_t userfaultfd_poll(struct file *file, 
poll_table *wait)
 
 static const struct file_operations userfaultfd_fops;
 
-static int resolve_userfault_fork(struct userfaultfd_ctx *ctx,
- struct userfaultfd_ctx *new,
+static int resolve_userfault_fork(struct userfaultfd_ctx *new,
+ struct inode *inode,
  struct uffd_msg *msg)
 {
int fd;
 
-   fd = anon_inode_getfd("[userfaultfd]", _fops, new,
- O_RDWR | (new->flags & UFFD_SHARED_FCNTL_FLAGS));
+   fd = anon_inode_getfd_secure("[userfaultfd]", _fops, new,
+   O_RDWR | (new->flags & UFFD_SHARED_FCNTL_FLAGS), inode);
if (fd < 0)
return fd;
 
@@ -996,7 +996,7 @@ static int resolve_userfault_fork(struct userfaultfd_ctx 
*ctx,
 }
 
 static ssize_t userfaultfd_ctx_read(struct userfaultfd_ctx *ctx, int no_wait,
-   struct uffd_msg *msg)
+   struct uffd_msg *msg, struct inode *inode)
 {
ssize_t ret;
DECLARE_WAITQUEUE(wait, current);
@@ -1107,7 +1107,7 @@ static ssize_t userfaultfd_ctx_read(struct 
userfaultfd_ctx *ctx, int no_wait,
spin_unlock_irq(>fd_wqh.lock);
 
if (!ret && msg->event == UFFD_EVENT_FORK) {
-   ret = resolve_userfault_fork(ctx, fork_nctx, msg);
+   ret = resolve_userfault_fork(fork_nctx, inode, msg);
spin_lock_irq(>event_wqh.lock);
if (!list_empty(_event)) {
/*
@@ -1167,6 +1167,7 @@ static ssize_t userfaultfd_read(struct file *file, char 
__user *buf,
ssize_t _ret, ret = 0;
struct uffd_msg msg;
int no_wait = file->f_flags & O_NONBLOCK;
+   struct inode *inode = file_inode(file);
 
if (ctx->state == UFFD_STATE_WAIT_API)
return -EINVAL;
@@ -1174,7 +1175,7 @@ static ssize_t userfaultfd_read(struct file *file, char 
__user *buf,
for (;;) {
if (count < sizeof(msg))
return ret ? ret : -EINVAL;
-   _ret = userfaultfd_ctx_read(ctx, no_wait, );
+   _ret = userfaultfd_ctx_read(ctx, no_wait, , inode);
if (_ret < 0)
return ret ? ret : _ret;
if (copy_to_user((__u64 __user *) buf, , sizeof(msg)))
@@ -1999,8 +2000,8 @@ SYSCALL_DEFINE1(userfaultfd, int, flags)
/* prevent the mm struct to be freed */
mmgrab(ctx->mm);
 
-   fd = anon_inode_getfd("[userfaultfd]", _fops, ctx,
- O_RDWR | (flags & UFFD_SHARED_FCNTL_FLAGS));
+   fd = anon_inode_getfd_secure("[userfaultfd]", _fops, ctx,
+   O_RDWR | (flags & UFFD_SHARED_FCNTL_FLAGS), NULL);
if (fd < 0) {
mmdrop(ctx->mm);
kmem_cache_free(userfaultfd_ctx_cachep, ctx);
-- 
2.30.0.284.gd98b1dd5eaa7-goog

[PATCH v14 3/4] selinux: teach SELinux about anonymous inodes

2021-01-07 Thread Lokesh Gidra

From: Daniel Colascione 

This change uses the anon_inodes and LSM infrastructure introduced in
the previous patches to give SELinux the ability to control
anonymous-inode files that are created using the new
anon_inode_getfd_secure() function.

A SELinux policy author detects and controls these anonymous inodes by
adding a name-based type_transition rule that assigns a new security
type to anonymous-inode files created in some domain. The name used
for the name-based transition is the name associated with the
anonymous inode for file listings --- e.g., "[userfaultfd]" or
"[perf_event]".

Example:

type uffd_t;
type_transition sysadm_t sysadm_t : anon_inode uffd_t "[userfaultfd]";
allow sysadm_t uffd_t:anon_inode { create };

(The next patch in this series is necessary for making userfaultfd
support this new interface.  The example above is just
for exposition.)

Signed-off-by: Daniel Colascione 
Signed-off-by: Lokesh Gidra 
---
 security/selinux/hooks.c| 59 +
 security/selinux/include/classmap.h |  2 +
 2 files changed, 61 insertions(+)

diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
index 644b17ec9e63..8b4e155b2930 100644
--- a/security/selinux/hooks.c
+++ b/security/selinux/hooks.c
@@ -2934,6 +2933,63 @@ static int selinux_inode_init_security(struct inode 
*inode, struct inode *dir,
return 0;
 }
 
+static int selinux_inode_init_security_anon(struct inode *inode,
+   const struct qstr *name,
+   const struct inode *context_inode)
+{
+   const struct task_security_struct *tsec = selinux_cred(current_cred());
+   struct common_audit_data ad;
+   struct inode_security_struct *isec;
+   int rc;
+
+   if (unlikely(!selinux_initialized(_state)))
+   return 0;
+
+   isec = selinux_inode(inode);
+
+   /*
+* We only get here once per ephemeral inode.  The inode has
+* been initialized via inode_alloc_security but is otherwise
+* untouched.
+*/
+   isec->initialized = LABEL_INITIALIZED;
+   isec->sclass = SECCLASS_ANON_INODE;
+
+   if (context_inode) {
+   struct inode_security_struct *context_isec =
+   selinux_inode(context_inode);
+   if (context_isec->initialized != LABEL_INITIALIZED)
+   return -EACCES;
+   if (context_isec->sclass != SECCLASS_ANON_INODE) {
+   pr_err("SELinux:  initializing anonymous inode with 
non-anonymous inode");
+   return -EACCES;
+   }
+
+   isec->sid = context_isec->sid;
+   } else {
+   rc = security_transition_sid(
+   _state, tsec->sid, tsec->sid,
+   isec->sclass, name, >sid);
+   if (rc)
+   return rc;
+   }
+
+   /*
+* Now that we've initialized security, check whether we're
+* allowed to actually create this type of anonymous inode.
+*/
+
+   ad.type = LSM_AUDIT_DATA_INODE;
+   ad.u.inode = inode;
+
+   return avc_has_perm(_state,
+   tsec->sid,
+   isec->sid,
+   isec->sclass,
+   ANON_INODE__CREATE,
+   );
+}
+
 static int selinux_inode_create(struct inode *dir, struct dentry *dentry, 
umode_t mode)
 {
return may_create(dir, dentry, SECCLASS_FILE);
@@ -7000,6 +7057,7 @@ static struct security_hook_list selinux_hooks[] 
__lsm_ro_after_init = {
 
LSM_HOOK_INIT(inode_free_security, selinux_inode_free_security),
LSM_HOOK_INIT(inode_init_security, selinux_inode_init_security),
+   LSM_HOOK_INIT(inode_init_security_anon, 
selinux_inode_init_security_anon),
LSM_HOOK_INIT(inode_create, selinux_inode_create),
LSM_HOOK_INIT(inode_link, selinux_inode_link),
LSM_HOOK_INIT(inode_unlink, selinux_inode_unlink),
diff --git a/security/selinux/include/classmap.h 
b/security/selinux/include/classmap.h
index 40cebde62856..ba2e01a6955c 100644
--- a/security/selinux/include/classmap.h
+++ b/security/selinux/include/classmap.h
@@ -249,6 +249,8 @@ struct security_class_mapping secclass_map[] = {
  {"open", "cpu", "kernel", "tracepoint", "read", "write"} },
{ "lockdown",
  { "integrity", "confidentiality", NULL } },
+   { "anon_inode",
+ { COMMON_FILE_PERMS, NULL } },
{ NULL }
   };
 
-- 
2.30.0.284.gd98b1dd5eaa7-goog

[PATCH v14 0/4] SELinux support for anonymous inodes and UFFD

2021-01-07 Thread Lokesh Gidra

Userfaultfd in unprivileged contexts could be potentially very
useful. We'd like to harden userfaultfd to make such unprivileged use
less risky. This patch series allows SELinux to manage userfaultfd
file descriptors and in the future, other kinds of
anonymous-inode-based file descriptor.  SELinux policy authors can
apply policy types to anonymous inodes by providing name-based
transition rules keyed off the anonymous inode internal name (
"[userfaultfd]" in the case of userfaultfd(2) file descriptors) and
applying policy to the new SIDs thus produced.

With SELinux managed userfaultfd, an admin can control creation and
movement of the file descriptors. In particular, handling of
a userfaultfd descriptor by a different process is essentially a
ptrace access into the process, without any of the corresponding
security_ptrace_access_check() checks. For privacy, the admin may
want to deny such accesses, which is possible with SELinux support.

Inside the kernel, a new anon_inode interface, anon_inode_getfd_secure,
allows callers to opt into this SELinux management. In this new "secure"
mode, anon_inodes create new ephemeral inodes for anonymous file objects
instead of reusing the normal anon_inodes singleton dummy inode. A new
LSM hook gives security modules an opportunity to configure and veto
these ephemeral inodes.

This patch series is one of two fork of [1] and is an
alternative to [2].

The primary difference between the two patch series is that this
partch series creates a unique inode for each "secure" anonymous
inode, while the other patch series ([2]) continues using the
singleton dummy anonymous inode and adds a way to attach SELinux
security information directly to file objects.

I prefer the approach in this patch series because 1) it's a smaller
patch than [2], and 2) it produces a more regular security
architecture: in this patch series, secure anonymous inodes aren't
S_PRIVATE and they maintain the SELinux property that the label for a
file is in its inode. We do need an additional inode per anonymous
file, but per-struct-file inode creation doesn't seem to be a problem
for pipes and sockets.

The previous version of this feature ([1]) created a new SELinux
security class for userfaultfd file descriptors. This version adopts
the generic transition-based approach of [2].

This patch series also differs from [2] in that it doesn't affect all
anonymous inodes right away --- instead requiring anon_inodes callers
to opt in --- but this difference isn't one of basic approach. The
important question to resolve is whether we should be creating new
inodes or enhancing per-file data.

Changes from the first version of the patch:

  - Removed some error checks
  - Defined a new anon_inode SELinux class to resolve the
ambiguity in [3]
  - Inherit sclass as well as descriptor from context inode

Changes from the second version of the patch:

  - Fixed example policy in the commit message to reflect the use of
the new anon_inode class.

Changes from the third version of the patch:

  - Dropped the fops parameter to the LSM hook
  - Documented hook parameters
  - Fixed incorrect class used for SELinux transition
  - Removed stray UFFD changed early in the series
  - Removed a redundant ERR_PTR(PTR_ERR())

Changes from the fourth version of the patch:

  - Removed an unused parameter from an internal function
  - Fixed function documentation

Changes from the fifth version of the patch:

  - Fixed function documentation in fs/anon_inodes.c and
include/linux/lsm_hooks.h
  - Used anon_inode_getfd_secure() in userfaultfd() syscall and removed
owner from userfaultfd_ctx.

Changes from the sixth version of the patch:

  - Removed definition of anon_inode_getfile_secure() as there are no
callers.
  - Simplified function description of anon_inode_getfd_secure().
  - Elaborated more on the purpose of 'context_inode' in commit message.

Changes from the seventh version of the patch:

  - Fixed error handling in _anon_inode_getfile().
  - Fixed minor comment and indentation related issues.

Changes from the eighth version of the patch:

  - Replaced selinux_state.initialized with selinux_state.initialized

Changes from the ninth version of the patch:

  - Fixed function names in fs/anon_inodes.c
  - Fixed comment of anon_inode_getfd_secure()
  - Fixed name of the patch wherein userfaultfd code uses
anon_inode_getfd_secure()

Changes from the tenth version of the patch:

  - Split first patch into VFS and LSM specific patches
  - Fixed comments in fs/anon_inodes.c
  - Fixed comment of alloc_anon_inode()

Changes from the eleventh version of the patch:

  - Removed comment of alloc_anon_inode() for consistency with the code
  - Fixed explanation of LSM hook in the commit message

Changes from the twelfth version of the patch:
  - Replaced FILE__CREATE with ANON_INODE__CREATE while initializing
anon-inode's SELinux security struct.
  - Check context_inode's SE

Re: [PATCH v13 3/4] selinux: teach SELinux about anonymous inodes

2021-01-07 Thread Lokesh Gidra

On Thu, Jan 7, 2021 at 2:30 PM Paul Moore  wrote:
>
> On Wed, Jan 6, 2021 at 10:55 PM Lokesh Gidra  wrote:
> > On Wed, Jan 6, 2021 at 7:03 PM Paul Moore  wrote:
> > > On Wed, Nov 11, 2020 at 8:54 PM Lokesh Gidra  
> > > wrote:
> > > > From: Daniel Colascione 
> > > >
> > > > This change uses the anon_inodes and LSM infrastructure introduced in
> > > > the previous patches to give SELinux the ability to control
> > > > anonymous-inode files that are created using the new
> > > > anon_inode_getfd_secure() function.
> > > >
> > > > A SELinux policy author detects and controls these anonymous inodes by
> > > > adding a name-based type_transition rule that assigns a new security
> > > > type to anonymous-inode files created in some domain. The name used
> > > > for the name-based transition is the name associated with the
> > > > anonymous inode for file listings --- e.g., "[userfaultfd]" or
> > > > "[perf_event]".
> > > >
> > > > Example:
> > > >
> > > > type uffd_t;
> > > > type_transition sysadm_t sysadm_t : anon_inode uffd_t "[userfaultfd]";
> > > > allow sysadm_t uffd_t:anon_inode { create };
> > > >
> > > > (The next patch in this series is necessary for making userfaultfd
> > > > support this new interface.  The example above is just
> > > > for exposition.)
> > > >
> > > > Signed-off-by: Daniel Colascione 
> > > > Signed-off-by: Lokesh Gidra 
> > > > ---
> > > >  security/selinux/hooks.c| 56 +
> > > >  security/selinux/include/classmap.h |  2 ++
> > > >  2 files changed, 58 insertions(+)
> > > >
> > > > diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
> > > > index 6b1826fc3658..d092aa512868 100644
> > > > --- a/security/selinux/hooks.c
> > > > +++ b/security/selinux/hooks.c
> > > > @@ -2927,6 +2927,61 @@ static int selinux_inode_init_security(struct 
> > > > inode *inode, struct inode *dir,
> > > > return 0;
> > > >  }
> > > >
> > > > +static int selinux_inode_init_security_anon(struct inode *inode,
> > > > +   const struct qstr *name,
> > > > +   const struct inode 
> > > > *context_inode)
> > > > +{
> > > > +   const struct task_security_struct *tsec = 
> > > > selinux_cred(current_cred());
> > > > +   struct common_audit_data ad;
> > > > +   struct inode_security_struct *isec;
> > > > +   int rc;
> > > > +
> > > > +   if (unlikely(!selinux_initialized(_state)))
> > > > +   return 0;
> > > > +
> > > > +   isec = selinux_inode(inode);
> > > > +
> > > > +   /*
> > > > +* We only get here once per ephemeral inode.  The inode has
> > > > +* been initialized via inode_alloc_security but is otherwise
> > > > +* untouched.
> > > > +*/
> > > > +
> > > > +   if (context_inode) {
> > > > +   struct inode_security_struct *context_isec =
> > > > +   selinux_inode(context_inode);
> > > > +   if (context_isec->initialized != LABEL_INITIALIZED)
> > > > +   return -EACCES;
> > > > +
> > > > +   isec->sclass = context_isec->sclass;
> > >
> > > Taking the object class directly from the context_inode is
> > > interesting, and I suspect problematic.  In the case below where no
> > > context_inode is supplied the object class is set to
> > > SECCLASS_ANON_INODE, which is correct, but when a context_inode is
> > > supplied there is no guarantee that the object class will be set to
> > > SECCLASS_ANON_INODE.  This could both pose a problem for policy
> > > writers (how do you distinguish the anon inode from other normal file
> > > inodes in this case?) as well as an outright fault later in this
> > > function when we try to check the ANON_INODE__CREATE on an object
> > > other than a SECCLASS_ANON_INODE object.
> > >
> > Thanks for catching this. I'll initialize 'sclass' unconditionally to
> > SECCLASS

Re: [PATCH v13 3/4] selinux: teach SELinux about anonymous inodes

2021-01-07 Thread Paul Moore

On Wed, Jan 6, 2021 at 10:55 PM Lokesh Gidra  wrote:
> On Wed, Jan 6, 2021 at 7:03 PM Paul Moore  wrote:
> > On Wed, Nov 11, 2020 at 8:54 PM Lokesh Gidra  wrote:
> > > From: Daniel Colascione 
> > >
> > > This change uses the anon_inodes and LSM infrastructure introduced in
> > > the previous patches to give SELinux the ability to control
> > > anonymous-inode files that are created using the new
> > > anon_inode_getfd_secure() function.
> > >
> > > A SELinux policy author detects and controls these anonymous inodes by
> > > adding a name-based type_transition rule that assigns a new security
> > > type to anonymous-inode files created in some domain. The name used
> > > for the name-based transition is the name associated with the
> > > anonymous inode for file listings --- e.g., "[userfaultfd]" or
> > > "[perf_event]".
> > >
> > > Example:
> > >
> > > type uffd_t;
> > > type_transition sysadm_t sysadm_t : anon_inode uffd_t "[userfaultfd]";
> > > allow sysadm_t uffd_t:anon_inode { create };
> > >
> > > (The next patch in this series is necessary for making userfaultfd
> > > support this new interface.  The example above is just
> > > for exposition.)
> > >
> > > Signed-off-by: Daniel Colascione 
> > > Signed-off-by: Lokesh Gidra 
> > > ---
> > >  security/selinux/hooks.c| 56 +
> > >  security/selinux/include/classmap.h |  2 ++
> > >  2 files changed, 58 insertions(+)
> > >
> > > diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
> > > index 6b1826fc3658..d092aa512868 100644
> > > --- a/security/selinux/hooks.c
> > > +++ b/security/selinux/hooks.c
> > > @@ -2927,6 +2927,61 @@ static int selinux_inode_init_security(struct 
> > > inode *inode, struct inode *dir,
> > > return 0;
> > >  }
> > >
> > > +static int selinux_inode_init_security_anon(struct inode *inode,
> > > +   const struct qstr *name,
> > > +   const struct inode 
> > > *context_inode)
> > > +{
> > > +   const struct task_security_struct *tsec = 
> > > selinux_cred(current_cred());
> > > +   struct common_audit_data ad;
> > > +   struct inode_security_struct *isec;
> > > +   int rc;
> > > +
> > > +   if (unlikely(!selinux_initialized(_state)))
> > > +   return 0;
> > > +
> > > +   isec = selinux_inode(inode);
> > > +
> > > +   /*
> > > +* We only get here once per ephemeral inode.  The inode has
> > > +* been initialized via inode_alloc_security but is otherwise
> > > +* untouched.
> > > +*/
> > > +
> > > +   if (context_inode) {
> > > +   struct inode_security_struct *context_isec =
> > > +   selinux_inode(context_inode);
> > > +   if (context_isec->initialized != LABEL_INITIALIZED)
> > > +   return -EACCES;
> > > +
> > > +   isec->sclass = context_isec->sclass;
> >
> > Taking the object class directly from the context_inode is
> > interesting, and I suspect problematic.  In the case below where no
> > context_inode is supplied the object class is set to
> > SECCLASS_ANON_INODE, which is correct, but when a context_inode is
> > supplied there is no guarantee that the object class will be set to
> > SECCLASS_ANON_INODE.  This could both pose a problem for policy
> > writers (how do you distinguish the anon inode from other normal file
> > inodes in this case?) as well as an outright fault later in this
> > function when we try to check the ANON_INODE__CREATE on an object
> > other than a SECCLASS_ANON_INODE object.
> >
> Thanks for catching this. I'll initialize 'sclass' unconditionally to
> SECCLASS_ANON_INODE in the next version. Also, do you think I should
> add a check that context_inode's sclass must be SECCLASS_ANON_INODE to
> confirm that we never receive a regular inode as context_inode?

This is one of the reasons why I was asking if you ever saw the need
to use a regular inode here.  It seems much safer to me to add a check
to ensure that context_inode is SECCLASS_ANON_INODE and return an
error otherwise; I would also suggest emitting an error using pr_err()
with something along the lines of &quo

[PATCH 4.19 7/7] proc: Use a list of inodes to flush from proc

2021-01-06 Thread Wen Yang

From: "Eric W. Biederman" 

[ Upstream commit 7bc3e6e55acf065500a24621f3b313e7e5998acf ]

Rework the flushing of proc to use a list of directory inodes that
need to be flushed.

The list is kept on struct pid not on struct task_struct, as there is
a fixed connection between proc inodes and pids but at least for the
case of de_thread the pid of a task_struct changes.

This removes the dependency on proc_mnt which allows for different
mounts of proc having different mount options even in the same pid
namespace and this allows for the removal of proc_mnt which will
trivially the first mount of proc to honor it's mount options.

This flushing remains an optimization.  The functions
pid_delete_dentry and pid_revalidate ensure that ordinary dcache
management will not attempt to use dentries past the point their
respective task has died.  When unused the shrinker will
eventually be able to remove these dentries.

There is a case in de_thread where proc_flush_pid can be
called early for a given pid.  Which winds up being
safe (if suboptimal) as this is just an optiimization.

Only pid directories are put on the list as the other
per pid files are children of those directories and
d_invalidate on the directory will get them as well.

So that the pid can be used during flushing it's reference count is
taken in release_task and dropped in proc_flush_pid.  Further the call
of proc_flush_pid is moved after the tasklist_lock is released in
release_task so that it is certain that the pid has already been
unhashed when flushing it taking place.  This removes a small race
where a dentry could recreated.

As struct pid is supposed to be small and I need a per pid lock
I reuse the only lock that currently exists in struct pid the
the wait_pidfd.lock.

The net result is that this adds all of this functionality
with just a little extra list management overhead and
a single extra pointer in struct pid.

v2: Initialize pid->inodes.  I somehow failed to get that
initialization into the initial version of the patch.  A boot
failure was reported by "kernel test robot ", and
failure to initialize that pid->inodes matches all of the reported
symptoms.

Signed-off-by: Eric W. Biederman 
Fixes: f333c700c610 ("pidns: Add a limit on the number of pid
namespaces")
Fixes: 60347f6716aa ("pid namespaces: prepare proc_flust_task() to flush
entries from multiple proc trees")
Cc:  # 4.19.x: b3e583825266: clone: add
CLONE_PIDFD
Cc:  # 4.19.x: b53b0b9d9a61: pidfd: add polling
support
Cc:  # 4.19.x: 0afa5ca82212: proc: Rename in
proc_inode rename sysctl_inodes sibling_inodes
Cc:  # 4.19.x: 26dbc60f385f: proc: Generalize
proc_sys_prune_dcache into proc_prune_siblings_dcache
Cc:  # 4.19.x: 71448011ea2a: proc: Clear the
pieces of
proc_inode that proc_evict_inode cares about
Cc:  # 4.19.x: f90f3cafe8d5: Use d_invalidate in
proc_prune_siblings_dcache
Cc:  # 4.19.x
Signed-off-by: Wen Yang 
---
 fs/proc/base.c  | 111 
 fs/proc/inode.c |   2 +-
 fs/proc/internal.h  |   1 +
 include/linux/pid.h |   1 +
 include/linux/proc_fs.h |   4 +-
 kernel/exit.c   |   4 +-
 kernel/pid.c|   1 +
 7 files changed, 45 insertions(+), 79 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index 5e705fa..ea74c7c 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -1741,11 +1741,25 @@ void task_dump_owner(struct task_struct *task, umode_t 
mode,
*rgid = gid;
 }
 
+void proc_pid_evict_inode(struct proc_inode *ei)
+{
+   struct pid *pid = ei->pid;
+
+   if (S_ISDIR(ei->vfs_inode.i_mode)) {
+   spin_lock(>wait_pidfd.lock);
+   hlist_del_init_rcu(>sibling_inodes);
+   spin_unlock(>wait_pidfd.lock);
+   }
+
+   put_pid(pid);
+}
+
 struct inode *proc_pid_make_inode(struct super_block * sb,
  struct task_struct *task, umode_t mode)
 {
struct inode * inode;
struct proc_inode *ei;
+   struct pid *pid;
 
/* We need a new inode */
 
@@ -1763,10 +1777,18 @@ struct inode *proc_pid_make_inode(struct super_block * 
sb,
/*
 * grab the reference to task.
 */
-   ei->pid = get_task_pid(task, PIDTYPE_PID);
-   if (!ei->pid)
+   pid = get_task_pid(task, PIDTYPE_PID);
+   if (!pid)
goto out_unlock;
 
+   /* Let the pid remember us for quick removal */
+   ei->pid = pid;
+   if (S_ISDIR(mode)) {
+   spin_lock(>wait_pidfd.lock);
+   hlist_add_head_rcu(>sibling_inodes, >inodes);
+   spin_unlock(>wait_pidfd.lock);
+   }
+
task_dump_owner(task, 0, >i_uid, >i_gid);
security_task_to_inode(task, inode);
 
@@ -3067,90 +3089,29 @@ static struct dentry *proc_tgid_base_lookup(struct 
inode *dir, struct dentry *de
.permission = proc_pid_permis

[PATCH v2 4.9 10/10] proc: Use a list of inodes to flush from proc

2021-01-06 Thread Wen Yang

From: "Eric W. Biederman" 

[ Upstream commit 7bc3e6e55acf065500a24621f3b313e7e5998acf ]

Rework the flushing of proc to use a list of directory inodes that
need to be flushed.

The list is kept on struct pid not on struct task_struct, as there is
a fixed connection between proc inodes and pids but at least for the
case of de_thread the pid of a task_struct changes.

This removes the dependency on proc_mnt which allows for different
mounts of proc having different mount options even in the same pid
namespace and this allows for the removal of proc_mnt which will
trivially the first mount of proc to honor it's mount options.

This flushing remains an optimization.  The functions
pid_delete_dentry and pid_revalidate ensure that ordinary dcache
management will not attempt to use dentries past the point their
respective task has died.  When unused the shrinker will
eventually be able to remove these dentries.

There is a case in de_thread where proc_flush_pid can be
called early for a given pid.  Which winds up being
safe (if suboptimal) as this is just an optiimization.

Only pid directories are put on the list as the other
per pid files are children of those directories and
d_invalidate on the directory will get them as well.

So that the pid can be used during flushing it's reference count is
taken in release_task and dropped in proc_flush_pid.  Further the call
of proc_flush_pid is moved after the tasklist_lock is released in
release_task so that it is certain that the pid has already been
unhashed when flushing it taking place.  This removes a small race
where a dentry could recreated.

As struct pid is supposed to be small and I need a per pid lock
I reuse the only lock that currently exists in struct pid the
the wait_pidfd.lock.

The net result is that this adds all of this functionality
with just a little extra list management overhead and
a single extra pointer in struct pid.

v2: Initialize pid->inodes.  I somehow failed to get that
initialization into the initial version of the patch.  A boot
failure was reported by "kernel test robot ", and
failure to initialize that pid->inodes matches all of the reported
symptoms.

Signed-off-by: Eric W. Biederman 
Fixes: f333c700c610 ("pidns: Add a limit on the number of pid
namespaces")
Fixes: 60347f6716aa ("pid namespaces: prepare proc_flust_task() to flush
entries from multiple proc trees")
Cc:  # 4.9.x: b3e583825266: clone: add CLONE_PIDFD
Cc:  # 4.9.x: b53b0b9d9a61: pidfd: add polling
support
Cc:  # 4.9.x: db978da8fa1d: proc: Pass file mode to
proc_pid_make_inode
Cc:  # 4.9.x: 68eb94f16227: proc: Better ownership of
files for non-dumpable tasks in user namespaces
Cc:  # 4.9.x: e3912ac37e07: proc: use %u for pid
printing and slightly less stack
Cc:  # 4.9.x: 0afa5ca82212: proc: Rename in
proc_inode rename sysctl_inodes sibling_inodes
Cc:  # 4.9.x: 26dbc60f385f: proc: Generalize
proc_sys_prune_dcache into proc_prune_siblings_dcache
Cc:  # 4.9.x: 71448011ea2a: proc: Clear the pieces of
proc_inode that proc_evict_inode cares about
Cc:  # 4.9.x: f90f3cafe8d5: Use d_invalidate in
proc_prune_siblings_dcache
Cc:  # 4.9.x
(proc: fix up cherry-pick conflicts for 7bc3e6e55acf)
Signed-off-by: Wen Yang 
---
 fs/proc/base.c  | 111 
 fs/proc/inode.c |   2 +-
 fs/proc/internal.h  |   1 +
 include/linux/pid.h |   1 +
 include/linux/proc_fs.h |   4 +-
 kernel/exit.c   |   5 ++-
 kernel/pid.c|   1 +
 7 files changed, 45 insertions(+), 80 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index 3502a40..11caf35 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -1728,11 +1728,25 @@ void task_dump_owner(struct task_struct *task, mode_t 
mode,
*rgid = gid;
 }
 
+void proc_pid_evict_inode(struct proc_inode *ei)
+{
+   struct pid *pid = ei->pid;
+
+   if (S_ISDIR(ei->vfs_inode.i_mode)) {
+   spin_lock(>wait_pidfd.lock);
+   hlist_del_init_rcu(>sibling_inodes);
+   spin_unlock(>wait_pidfd.lock);
+   }
+
+   put_pid(pid);
+}
+
 struct inode *proc_pid_make_inode(struct super_block * sb,
  struct task_struct *task, umode_t mode)
 {
struct inode * inode;
struct proc_inode *ei;
+   struct pid *pid;
 
/* We need a new inode */
 
@@ -1750,10 +1764,18 @@ struct inode *proc_pid_make_inode(struct super_block * 
sb,
/*
 * grab the reference to task.
 */
-   ei->pid = get_task_pid(task, PIDTYPE_PID);
-   if (!ei->pid)
+   pid = get_task_pid(task, PIDTYPE_PID);
+   if (!pid)
goto out_unlock;
 
+   /* Let the pid remember us for quick removal */
+   ei->pid = pid;
+   if (S_ISDIR(mode)) {
+   spin_lock(>wait_pidfd.lock);
+   hlist_add_head_rcu(>sibling_inodes, >inodes);
+   spi

Re: [PATCH v13 3/4] selinux: teach SELinux about anonymous inodes

2021-01-06 Thread Lokesh Gidra

On Wed, Jan 6, 2021 at 7:03 PM Paul Moore  wrote:
>
> On Wed, Nov 11, 2020 at 8:54 PM Lokesh Gidra  wrote:
> > From: Daniel Colascione 
> >
> > This change uses the anon_inodes and LSM infrastructure introduced in
> > the previous patches to give SELinux the ability to control
> > anonymous-inode files that are created using the new
> > anon_inode_getfd_secure() function.
> >
> > A SELinux policy author detects and controls these anonymous inodes by
> > adding a name-based type_transition rule that assigns a new security
> > type to anonymous-inode files created in some domain. The name used
> > for the name-based transition is the name associated with the
> > anonymous inode for file listings --- e.g., "[userfaultfd]" or
> > "[perf_event]".
> >
> > Example:
> >
> > type uffd_t;
> > type_transition sysadm_t sysadm_t : anon_inode uffd_t "[userfaultfd]";
> > allow sysadm_t uffd_t:anon_inode { create };
> >
> > (The next patch in this series is necessary for making userfaultfd
> > support this new interface.  The example above is just
> > for exposition.)
> >
> > Signed-off-by: Daniel Colascione 
> > Signed-off-by: Lokesh Gidra 
> > ---
> >  security/selinux/hooks.c| 56 +
> >  security/selinux/include/classmap.h |  2 ++
> >  2 files changed, 58 insertions(+)
> >
> > diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
> > index 6b1826fc3658..d092aa512868 100644
> > --- a/security/selinux/hooks.c
> > +++ b/security/selinux/hooks.c
> > @@ -2927,6 +2927,61 @@ static int selinux_inode_init_security(struct inode 
> > *inode, struct inode *dir,
> > return 0;
> >  }
> >
> > +static int selinux_inode_init_security_anon(struct inode *inode,
> > +   const struct qstr *name,
> > +   const struct inode 
> > *context_inode)
> > +{
> > +   const struct task_security_struct *tsec = 
> > selinux_cred(current_cred());
> > +   struct common_audit_data ad;
> > +   struct inode_security_struct *isec;
> > +   int rc;
> > +
> > +   if (unlikely(!selinux_initialized(_state)))
> > +   return 0;
> > +
> > +   isec = selinux_inode(inode);
> > +
> > +   /*
> > +* We only get here once per ephemeral inode.  The inode has
> > +* been initialized via inode_alloc_security but is otherwise
> > +* untouched.
> > +*/
> > +
> > +   if (context_inode) {
> > +   struct inode_security_struct *context_isec =
> > +   selinux_inode(context_inode);
> > +   if (context_isec->initialized != LABEL_INITIALIZED)
> > +   return -EACCES;
> > +
> > +   isec->sclass = context_isec->sclass;
>
> Taking the object class directly from the context_inode is
> interesting, and I suspect problematic.  In the case below where no
> context_inode is supplied the object class is set to
> SECCLASS_ANON_INODE, which is correct, but when a context_inode is
> supplied there is no guarantee that the object class will be set to
> SECCLASS_ANON_INODE.  This could both pose a problem for policy
> writers (how do you distinguish the anon inode from other normal file
> inodes in this case?) as well as an outright fault later in this
> function when we try to check the ANON_INODE__CREATE on an object
> other than a SECCLASS_ANON_INODE object.
>
Thanks for catching this. I'll initialize 'sclass' unconditionally to
SECCLASS_ANON_INODE in the next version. Also, do you think I should
add a check that context_inode's sclass must be SECCLASS_ANON_INODE to
confirm that we never receive a regular inode as context_inode?

> It works in the userfaultfd case because the context_inode is
> originally created with this function so the object class is correctly
> set to SECCLASS_ANON_INODE, but can we always guarantee that to be the
> case?  Do we ever need or want to support using a context_inode that
> is not SECCLASS_ANON_INODE?
>

I don't think there is any requirement of supporting context_inode
which isn't anon-inode. And even if there is, as you described
earlier, for ANON_INODE__CREATE to work the sclass has to be
SECCLASS_ANON_INODE. I'll appreciate comments on this from others,
particularly Daniel and Stephen who originally discussed and
implemented this patch.


> > +   isec->sid = context_isec->sid;
> > +   } else {
> > +   isec-&

Re: [PATCH v13 3/4] selinux: teach SELinux about anonymous inodes

2021-01-06 Thread Paul Moore

On Wed, Nov 11, 2020 at 8:54 PM Lokesh Gidra  wrote:
> From: Daniel Colascione 
>
> This change uses the anon_inodes and LSM infrastructure introduced in
> the previous patches to give SELinux the ability to control
> anonymous-inode files that are created using the new
> anon_inode_getfd_secure() function.
>
> A SELinux policy author detects and controls these anonymous inodes by
> adding a name-based type_transition rule that assigns a new security
> type to anonymous-inode files created in some domain. The name used
> for the name-based transition is the name associated with the
> anonymous inode for file listings --- e.g., "[userfaultfd]" or
> "[perf_event]".
>
> Example:
>
> type uffd_t;
> type_transition sysadm_t sysadm_t : anon_inode uffd_t "[userfaultfd]";
> allow sysadm_t uffd_t:anon_inode { create };
>
> (The next patch in this series is necessary for making userfaultfd
> support this new interface.  The example above is just
> for exposition.)
>
> Signed-off-by: Daniel Colascione 
> Signed-off-by: Lokesh Gidra 
> ---
>  security/selinux/hooks.c| 56 +
>  security/selinux/include/classmap.h |  2 ++
>  2 files changed, 58 insertions(+)
>
> diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
> index 6b1826fc3658..d092aa512868 100644
> --- a/security/selinux/hooks.c
> +++ b/security/selinux/hooks.c
> @@ -2927,6 +2927,61 @@ static int selinux_inode_init_security(struct inode 
> *inode, struct inode *dir,
> return 0;
>  }
>
> +static int selinux_inode_init_security_anon(struct inode *inode,
> +   const struct qstr *name,
> +   const struct inode *context_inode)
> +{
> +   const struct task_security_struct *tsec = 
> selinux_cred(current_cred());
> +   struct common_audit_data ad;
> +   struct inode_security_struct *isec;
> +   int rc;
> +
> +   if (unlikely(!selinux_initialized(_state)))
> +   return 0;
> +
> +   isec = selinux_inode(inode);
> +
> +   /*
> +* We only get here once per ephemeral inode.  The inode has
> +* been initialized via inode_alloc_security but is otherwise
> +* untouched.
> +*/
> +
> +   if (context_inode) {
> +   struct inode_security_struct *context_isec =
> +   selinux_inode(context_inode);
> +   if (context_isec->initialized != LABEL_INITIALIZED)
> +   return -EACCES;
> +
> +   isec->sclass = context_isec->sclass;

Taking the object class directly from the context_inode is
interesting, and I suspect problematic.  In the case below where no
context_inode is supplied the object class is set to
SECCLASS_ANON_INODE, which is correct, but when a context_inode is
supplied there is no guarantee that the object class will be set to
SECCLASS_ANON_INODE.  This could both pose a problem for policy
writers (how do you distinguish the anon inode from other normal file
inodes in this case?) as well as an outright fault later in this
function when we try to check the ANON_INODE__CREATE on an object
other than a SECCLASS_ANON_INODE object.

It works in the userfaultfd case because the context_inode is
originally created with this function so the object class is correctly
set to SECCLASS_ANON_INODE, but can we always guarantee that to be the
case?  Do we ever need or want to support using a context_inode that
is not SECCLASS_ANON_INODE?

> +   isec->sid = context_isec->sid;
> +   } else {
> +   isec->sclass = SECCLASS_ANON_INODE;
> +   rc = security_transition_sid(
> +   _state, tsec->sid, tsec->sid,
> +   isec->sclass, name, >sid);
> +   if (rc)
> +   return rc;
> +   }
> +
> +   isec->initialized = LABEL_INITIALIZED;
> +
> +   /*
> +* Now that we've initialized security, check whether we're
> +* allowed to actually create this type of anonymous inode.
> +*/
> +
> +   ad.type = LSM_AUDIT_DATA_INODE;
> +   ad.u.inode = inode;
> +
> +   return avc_has_perm(_state,
> +   tsec->sid,
> +   isec->sid,
> +   isec->sclass,
> +   ANON_INODE__CREATE,
> +   );
> +}

-- 
paul moore
www.paul-moore.com

[PATCH 4.4 117/132] Btrfs: fix selftests failure due to uninitialized i_mode in test inodes

2020-12-28 Thread Greg Kroah-Hartman

From: Filipe Manana 

commit 9f7fec0ba89108b9385f1b9fb167861224912a4a upstream

Some of the self tests create a test inode, setup some extents and then do
calls to btrfs_get_extent() to test that the corresponding extent maps
exist and are correct. However btrfs_get_extent(), since the 5.2 merge
window, now errors out when it finds a regular or prealloc extent for an
inode that does not correspond to a regular file (its ->i_mode is not
S_IFREG). This causes the self tests to fail sometimes, specially when
KASAN, slub_debug and page poisoning are enabled:

  $ modprobe btrfs
  modprobe: ERROR: could not insert 'btrfs': Invalid argument

  $ dmesg
  [ 9414.691648] Btrfs loaded, crc32c=crc32c-intel, debug=on, assert=on, 
integrity-checker=on, ref-verify=on
  [ 9414.692655] BTRFS: selftest: sectorsize: 4096  nodesize: 4096
  [ 9414.692658] BTRFS: selftest: running btrfs free space cache tests
  [ 9414.692918] BTRFS: selftest: running extent only tests
  [ 9414.693061] BTRFS: selftest: running bitmap only tests
  [ 9414.693366] BTRFS: selftest: running bitmap and extent tests
  [ 9414.696455] BTRFS: selftest: running space stealing from bitmap to extent 
tests
  [ 9414.697131] BTRFS: selftest: running extent buffer operation tests
  [ 9414.697133] BTRFS: selftest: running btrfs_split_item tests
  [ 9414.697564] BTRFS: selftest: running extent I/O tests
  [ 9414.697583] BTRFS: selftest: running find delalloc tests
  [ 9415.081125] BTRFS: selftest: running find_first_clear_extent_bit test
  [ 9415.081278] BTRFS: selftest: running extent buffer bitmap tests
  [ 9415.124192] BTRFS: selftest: running inode tests
  [ 9415.124195] BTRFS: selftest: running btrfs_get_extent tests
  [ 9415.127909] BTRFS: selftest: running hole first btrfs_get_extent test
  [ 9415.128343] BTRFS critical (device (efault)): regular/prealloc extent 
found for non-regular inode 256
  [ 9415.131428] BTRFS: selftest: fs/btrfs/tests/inode-tests.c:904 expected a 
real extent, got 0

This happens because the test inodes are created without ever initializing
the i_mode field of the inode, and neither VFS's new_inode() nor the btrfs
callback btrfs_alloc_inode() initialize the i_mode. Initialization of the
i_mode is done through the various callbacks used by the VFS to create
new inodes (regular files, directories, symlinks, tmpfiles, etc), which
all call btrfs_new_inode() which in turn calls inode_init_owner(), which
sets the inode's i_mode. Since the tests only uses new_inode() to create
the test inodes, the i_mode was never initialized.

This always happens on a VM I used with kasan, slub_debug and many other
debug facilities enabled. It also happened to someone who reported this
on bugzilla (on a 5.3-rc).

Fix this by setting i_mode to S_IFREG at btrfs_new_test_inode().

Fixes: 6bf9e4bd6a2778 ("btrfs: inode: Verify inode mode to avoid NULL pointer 
dereference")
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=204397
Signed-off-by: Filipe Manana 
Reviewed-by: Qu Wenruo 
Signed-off-by: David Sterba 
Signed-off-by: Sudip Mukherjee 
Signed-off-by: Greg Kroah-Hartman 
---
 fs/btrfs/tests/btrfs-tests.c |8 +++-
 1 file changed, 7 insertions(+), 1 deletion(-)

--- a/fs/btrfs/tests/btrfs-tests.c
+++ b/fs/btrfs/tests/btrfs-tests.c
@@ -48,7 +48,13 @@ static struct file_system_type test_type
 
 struct inode *btrfs_new_test_inode(void)
 {
-   return new_inode(test_mnt->mnt_sb);
+   struct inode *inode;
+
+   inode = new_inode(test_mnt->mnt_sb);
+   if (inode)
+   inode_init_owner(inode, NULL, S_IFREG);
+
+   return inode;
 }
 
 int btrfs_init_test_fs(void)

[PATCH 4.9 153/175] Btrfs: fix selftests failure due to uninitialized i_mode in test inodes

2020-12-28 Thread Greg Kroah-Hartman

From: Filipe Manana 

commit 9f7fec0ba89108b9385f1b9fb167861224912a4a upstream

Some of the self tests create a test inode, setup some extents and then do
calls to btrfs_get_extent() to test that the corresponding extent maps
exist and are correct. However btrfs_get_extent(), since the 5.2 merge
window, now errors out when it finds a regular or prealloc extent for an
inode that does not correspond to a regular file (its ->i_mode is not
S_IFREG). This causes the self tests to fail sometimes, specially when
KASAN, slub_debug and page poisoning are enabled:

  $ modprobe btrfs
  modprobe: ERROR: could not insert 'btrfs': Invalid argument

  $ dmesg
  [ 9414.691648] Btrfs loaded, crc32c=crc32c-intel, debug=on, assert=on, 
integrity-checker=on, ref-verify=on
  [ 9414.692655] BTRFS: selftest: sectorsize: 4096  nodesize: 4096
  [ 9414.692658] BTRFS: selftest: running btrfs free space cache tests
  [ 9414.692918] BTRFS: selftest: running extent only tests
  [ 9414.693061] BTRFS: selftest: running bitmap only tests
  [ 9414.693366] BTRFS: selftest: running bitmap and extent tests
  [ 9414.696455] BTRFS: selftest: running space stealing from bitmap to extent 
tests
  [ 9414.697131] BTRFS: selftest: running extent buffer operation tests
  [ 9414.697133] BTRFS: selftest: running btrfs_split_item tests
  [ 9414.697564] BTRFS: selftest: running extent I/O tests
  [ 9414.697583] BTRFS: selftest: running find delalloc tests
  [ 9415.081125] BTRFS: selftest: running find_first_clear_extent_bit test
  [ 9415.081278] BTRFS: selftest: running extent buffer bitmap tests
  [ 9415.124192] BTRFS: selftest: running inode tests
  [ 9415.124195] BTRFS: selftest: running btrfs_get_extent tests
  [ 9415.127909] BTRFS: selftest: running hole first btrfs_get_extent test
  [ 9415.128343] BTRFS critical (device (efault)): regular/prealloc extent 
found for non-regular inode 256
  [ 9415.131428] BTRFS: selftest: fs/btrfs/tests/inode-tests.c:904 expected a 
real extent, got 0

This happens because the test inodes are created without ever initializing
the i_mode field of the inode, and neither VFS's new_inode() nor the btrfs
callback btrfs_alloc_inode() initialize the i_mode. Initialization of the
i_mode is done through the various callbacks used by the VFS to create
new inodes (regular files, directories, symlinks, tmpfiles, etc), which
all call btrfs_new_inode() which in turn calls inode_init_owner(), which
sets the inode's i_mode. Since the tests only uses new_inode() to create
the test inodes, the i_mode was never initialized.

This always happens on a VM I used with kasan, slub_debug and many other
debug facilities enabled. It also happened to someone who reported this
on bugzilla (on a 5.3-rc).

Fix this by setting i_mode to S_IFREG at btrfs_new_test_inode().

Fixes: 6bf9e4bd6a2778 ("btrfs: inode: Verify inode mode to avoid NULL pointer 
dereference")
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=204397
Signed-off-by: Filipe Manana 
Reviewed-by: Qu Wenruo 
Signed-off-by: David Sterba 
Signed-off-by: Sudip Mukherjee 
Signed-off-by: Greg Kroah-Hartman 
---
 fs/btrfs/tests/btrfs-tests.c |8 +++-
 1 file changed, 7 insertions(+), 1 deletion(-)

--- a/fs/btrfs/tests/btrfs-tests.c
+++ b/fs/btrfs/tests/btrfs-tests.c
@@ -51,7 +51,13 @@ static struct file_system_type test_type
 
 struct inode *btrfs_new_test_inode(void)
 {
-   return new_inode(test_mnt->mnt_sb);
+   struct inode *inode;
+
+   inode = new_inode(test_mnt->mnt_sb);
+   if (inode)
+   inode_init_owner(inode, NULL, S_IFREG);
+
+   return inode;
 }
 
 static int btrfs_init_test_fs(void)

[PATCH 4.14 206/242] Btrfs: fix selftests failure due to uninitialized i_mode in test inodes

2020-12-28 Thread Greg Kroah-Hartman

From: Filipe Manana 

commit 9f7fec0ba89108b9385f1b9fb167861224912a4a upstream

Some of the self tests create a test inode, setup some extents and then do
calls to btrfs_get_extent() to test that the corresponding extent maps
exist and are correct. However btrfs_get_extent(), since the 5.2 merge
window, now errors out when it finds a regular or prealloc extent for an
inode that does not correspond to a regular file (its ->i_mode is not
S_IFREG). This causes the self tests to fail sometimes, specially when
KASAN, slub_debug and page poisoning are enabled:

  $ modprobe btrfs
  modprobe: ERROR: could not insert 'btrfs': Invalid argument

  $ dmesg
  [ 9414.691648] Btrfs loaded, crc32c=crc32c-intel, debug=on, assert=on, 
integrity-checker=on, ref-verify=on
  [ 9414.692655] BTRFS: selftest: sectorsize: 4096  nodesize: 4096
  [ 9414.692658] BTRFS: selftest: running btrfs free space cache tests
  [ 9414.692918] BTRFS: selftest: running extent only tests
  [ 9414.693061] BTRFS: selftest: running bitmap only tests
  [ 9414.693366] BTRFS: selftest: running bitmap and extent tests
  [ 9414.696455] BTRFS: selftest: running space stealing from bitmap to extent 
tests
  [ 9414.697131] BTRFS: selftest: running extent buffer operation tests
  [ 9414.697133] BTRFS: selftest: running btrfs_split_item tests
  [ 9414.697564] BTRFS: selftest: running extent I/O tests
  [ 9414.697583] BTRFS: selftest: running find delalloc tests
  [ 9415.081125] BTRFS: selftest: running find_first_clear_extent_bit test
  [ 9415.081278] BTRFS: selftest: running extent buffer bitmap tests
  [ 9415.124192] BTRFS: selftest: running inode tests
  [ 9415.124195] BTRFS: selftest: running btrfs_get_extent tests
  [ 9415.127909] BTRFS: selftest: running hole first btrfs_get_extent test
  [ 9415.128343] BTRFS critical (device (efault)): regular/prealloc extent 
found for non-regular inode 256
  [ 9415.131428] BTRFS: selftest: fs/btrfs/tests/inode-tests.c:904 expected a 
real extent, got 0

This happens because the test inodes are created without ever initializing
the i_mode field of the inode, and neither VFS's new_inode() nor the btrfs
callback btrfs_alloc_inode() initialize the i_mode. Initialization of the
i_mode is done through the various callbacks used by the VFS to create
new inodes (regular files, directories, symlinks, tmpfiles, etc), which
all call btrfs_new_inode() which in turn calls inode_init_owner(), which
sets the inode's i_mode. Since the tests only uses new_inode() to create
the test inodes, the i_mode was never initialized.

This always happens on a VM I used with kasan, slub_debug and many other
debug facilities enabled. It also happened to someone who reported this
on bugzilla (on a 5.3-rc).

Fix this by setting i_mode to S_IFREG at btrfs_new_test_inode().

Fixes: 6bf9e4bd6a2778 ("btrfs: inode: Verify inode mode to avoid NULL pointer 
dereference")
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=204397
Signed-off-by: Filipe Manana 
Reviewed-by: Qu Wenruo 
Signed-off-by: David Sterba 
Signed-off-by: Sudip Mukherjee 
Signed-off-by: Greg Kroah-Hartman 
---
 fs/btrfs/tests/btrfs-tests.c |8 +++-
 1 file changed, 7 insertions(+), 1 deletion(-)

--- a/fs/btrfs/tests/btrfs-tests.c
+++ b/fs/btrfs/tests/btrfs-tests.c
@@ -51,7 +51,13 @@ static struct file_system_type test_type
 
 struct inode *btrfs_new_test_inode(void)
 {
-   return new_inode(test_mnt->mnt_sb);
+   struct inode *inode;
+
+   inode = new_inode(test_mnt->mnt_sb);
+   if (inode)
+   inode_init_owner(inode, NULL, S_IFREG);
+
+   return inode;
 }
 
 static int btrfs_init_test_fs(void)

[PATCH 4.19 298/346] ext4: fix deadlock with fs freezing and EA inodes

2020-12-28 Thread Greg Kroah-Hartman

From: Jan Kara 

commit 46e294efc355c48d1dd4d58501aa56dac461792a upstream.

Xattr code using inodes with large xattr data can end up dropping last
inode reference (and thus deleting the inode) from places like
ext4_xattr_set_entry(). That function is called with transaction started
and so ext4_evict_inode() can deadlock against fs freezing like:

CPU1CPU2

removexattr()   freeze_super()
  vfs_removexattr()
ext4_xattr_set()
  handle = ext4_journal_start()
  ...
  ext4_xattr_set_entry()
iput(old_ea_inode)
  ext4_evict_inode(old_ea_inode)
  sb->s_writers.frozen = SB_FREEZE_FS;
  sb_wait_write(sb, SB_FREEZE_FS);
  ext4_freeze()
jbd2_journal_lock_updates()
  -> blocks waiting for all
 handles to stop
sb_start_intwrite()
  -> blocks as sb is already in SB_FREEZE_FS state

Generally it is advisable to delete inodes from a separate transaction
as it can consume quite some credits however in this case it would be
quite clumsy and furthermore the credits for inode deletion are quite
limited and already accounted for. So just tweak ext4_evict_inode() to
avoid freeze protection if we have transaction already started and thus
it is not really needed anyway.

Cc: sta...@vger.kernel.org
Fixes: dec214d00e0d ("ext4: xattr inode deduplication")
Signed-off-by: Jan Kara 
Reviewed-by: Andreas Dilger 
Link: https://lore.kernel.org/r/20201127110649.24730-1-j...@suse.cz
Signed-off-by: Theodore Ts'o 
Signed-off-by: Greg Kroah-Hartman 

---
 fs/ext4/inode.c |   21 +++--
 1 file changed, 15 insertions(+), 6 deletions(-)

--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -203,6 +203,7 @@ void ext4_evict_inode(struct inode *inod
 */
int extra_credits = 6;
struct ext4_xattr_inode_array *ea_inode_array = NULL;
+   bool freeze_protected = false;
 
trace_ext4_evict_inode(inode);
 
@@ -250,9 +251,14 @@ void ext4_evict_inode(struct inode *inod
 
/*
 * Protect us against freezing - iput() caller didn't have to have any
-* protection against it
-*/
-   sb_start_intwrite(inode->i_sb);
+* protection against it. When we are in a running transaction though,
+* we are already protected against freezing and we cannot grab further
+* protection due to lock ordering constraints.
+*/
+   if (!ext4_journal_current_handle()) {
+   sb_start_intwrite(inode->i_sb);
+   freeze_protected = true;
+   }
 
if (!IS_NOQUOTA(inode))
extra_credits += EXT4_MAXQUOTAS_DEL_BLOCKS(inode->i_sb);
@@ -271,7 +277,8 @@ void ext4_evict_inode(struct inode *inod
 * cleaned up.
 */
ext4_orphan_del(NULL, inode);
-   sb_end_intwrite(inode->i_sb);
+   if (freeze_protected)
+   sb_end_intwrite(inode->i_sb);
goto no_delete;
}
 
@@ -312,7 +319,8 @@ void ext4_evict_inode(struct inode *inod
 stop_handle:
ext4_journal_stop(handle);
ext4_orphan_del(NULL, inode);
-   sb_end_intwrite(inode->i_sb);
+   if (freeze_protected)
+   sb_end_intwrite(inode->i_sb);
ext4_xattr_inode_array_free(ea_inode_array);
goto no_delete;
}
@@ -341,7 +349,8 @@ stop_handle:
else
ext4_free_inode(handle, inode);
ext4_journal_stop(handle);
-   sb_end_intwrite(inode->i_sb);
+   if (freeze_protected)
+   sb_end_intwrite(inode->i_sb);
ext4_xattr_inode_array_free(ea_inode_array);
return;
 no_delete:

[PATCH 5.4 375/453] ext4: fix deadlock with fs freezing and EA inodes

2020-12-28 Thread Greg Kroah-Hartman

From: Jan Kara 

commit 46e294efc355c48d1dd4d58501aa56dac461792a upstream.

Xattr code using inodes with large xattr data can end up dropping last
inode reference (and thus deleting the inode) from places like
ext4_xattr_set_entry(). That function is called with transaction started
and so ext4_evict_inode() can deadlock against fs freezing like:

CPU1CPU2

removexattr()   freeze_super()
  vfs_removexattr()
ext4_xattr_set()
  handle = ext4_journal_start()
  ...
  ext4_xattr_set_entry()
iput(old_ea_inode)
  ext4_evict_inode(old_ea_inode)
  sb->s_writers.frozen = SB_FREEZE_FS;
  sb_wait_write(sb, SB_FREEZE_FS);
  ext4_freeze()
jbd2_journal_lock_updates()
  -> blocks waiting for all
 handles to stop
sb_start_intwrite()
  -> blocks as sb is already in SB_FREEZE_FS state

Generally it is advisable to delete inodes from a separate transaction
as it can consume quite some credits however in this case it would be
quite clumsy and furthermore the credits for inode deletion are quite
limited and already accounted for. So just tweak ext4_evict_inode() to
avoid freeze protection if we have transaction already started and thus
it is not really needed anyway.

Cc: sta...@vger.kernel.org
Fixes: dec214d00e0d ("ext4: xattr inode deduplication")
Signed-off-by: Jan Kara 
Reviewed-by: Andreas Dilger 
Link: https://lore.kernel.org/r/20201127110649.24730-1-j...@suse.cz
Signed-off-by: Theodore Ts'o 
Signed-off-by: Greg Kroah-Hartman 

---
 fs/ext4/inode.c |   21 +++--
 1 file changed, 15 insertions(+), 6 deletions(-)

--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -203,6 +203,7 @@ void ext4_evict_inode(struct inode *inod
 */
int extra_credits = 6;
struct ext4_xattr_inode_array *ea_inode_array = NULL;
+   bool freeze_protected = false;
 
trace_ext4_evict_inode(inode);
 
@@ -250,9 +251,14 @@ void ext4_evict_inode(struct inode *inod
 
/*
 * Protect us against freezing - iput() caller didn't have to have any
-* protection against it
-*/
-   sb_start_intwrite(inode->i_sb);
+* protection against it. When we are in a running transaction though,
+* we are already protected against freezing and we cannot grab further
+* protection due to lock ordering constraints.
+*/
+   if (!ext4_journal_current_handle()) {
+   sb_start_intwrite(inode->i_sb);
+   freeze_protected = true;
+   }
 
if (!IS_NOQUOTA(inode))
extra_credits += EXT4_MAXQUOTAS_DEL_BLOCKS(inode->i_sb);
@@ -271,7 +277,8 @@ void ext4_evict_inode(struct inode *inod
 * cleaned up.
 */
ext4_orphan_del(NULL, inode);
-   sb_end_intwrite(inode->i_sb);
+   if (freeze_protected)
+   sb_end_intwrite(inode->i_sb);
goto no_delete;
}
 
@@ -312,7 +319,8 @@ void ext4_evict_inode(struct inode *inod
 stop_handle:
ext4_journal_stop(handle);
ext4_orphan_del(NULL, inode);
-   sb_end_intwrite(inode->i_sb);
+   if (freeze_protected)
+   sb_end_intwrite(inode->i_sb);
ext4_xattr_inode_array_free(ea_inode_array);
goto no_delete;
}
@@ -341,7 +349,8 @@ stop_handle:
else
ext4_free_inode(handle, inode);
ext4_journal_stop(handle);
-   sb_end_intwrite(inode->i_sb);
+   if (freeze_protected)
+   sb_end_intwrite(inode->i_sb);
ext4_xattr_inode_array_free(ea_inode_array);
return;
 no_delete:

[PATCH 5.10 592/717] ext4: fix deadlock with fs freezing and EA inodes

2020-12-28 Thread Greg Kroah-Hartman

From: Jan Kara 

commit 46e294efc355c48d1dd4d58501aa56dac461792a upstream.

Xattr code using inodes with large xattr data can end up dropping last
inode reference (and thus deleting the inode) from places like
ext4_xattr_set_entry(). That function is called with transaction started
and so ext4_evict_inode() can deadlock against fs freezing like:

CPU1CPU2

removexattr()   freeze_super()
  vfs_removexattr()
ext4_xattr_set()
  handle = ext4_journal_start()
  ...
  ext4_xattr_set_entry()
iput(old_ea_inode)
  ext4_evict_inode(old_ea_inode)
  sb->s_writers.frozen = SB_FREEZE_FS;
  sb_wait_write(sb, SB_FREEZE_FS);
  ext4_freeze()
jbd2_journal_lock_updates()
  -> blocks waiting for all
 handles to stop
sb_start_intwrite()
  -> blocks as sb is already in SB_FREEZE_FS state

Generally it is advisable to delete inodes from a separate transaction
as it can consume quite some credits however in this case it would be
quite clumsy and furthermore the credits for inode deletion are quite
limited and already accounted for. So just tweak ext4_evict_inode() to
avoid freeze protection if we have transaction already started and thus
it is not really needed anyway.

Cc: sta...@vger.kernel.org
Fixes: dec214d00e0d ("ext4: xattr inode deduplication")
Signed-off-by: Jan Kara 
Reviewed-by: Andreas Dilger 
Link: https://lore.kernel.org/r/20201127110649.24730-1-j...@suse.cz
Signed-off-by: Theodore Ts'o 
Signed-off-by: Greg Kroah-Hartman 

---
 fs/ext4/inode.c |   21 +++--
 1 file changed, 15 insertions(+), 6 deletions(-)

--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -175,6 +175,7 @@ void ext4_evict_inode(struct inode *inod
 */
int extra_credits = 6;
struct ext4_xattr_inode_array *ea_inode_array = NULL;
+   bool freeze_protected = false;
 
trace_ext4_evict_inode(inode);
 
@@ -232,9 +233,14 @@ void ext4_evict_inode(struct inode *inod
 
/*
 * Protect us against freezing - iput() caller didn't have to have any
-* protection against it
-*/
-   sb_start_intwrite(inode->i_sb);
+* protection against it. When we are in a running transaction though,
+* we are already protected against freezing and we cannot grab further
+* protection due to lock ordering constraints.
+*/
+   if (!ext4_journal_current_handle()) {
+   sb_start_intwrite(inode->i_sb);
+   freeze_protected = true;
+   }
 
if (!IS_NOQUOTA(inode))
extra_credits += EXT4_MAXQUOTAS_DEL_BLOCKS(inode->i_sb);
@@ -253,7 +259,8 @@ void ext4_evict_inode(struct inode *inod
 * cleaned up.
 */
ext4_orphan_del(NULL, inode);
-   sb_end_intwrite(inode->i_sb);
+   if (freeze_protected)
+   sb_end_intwrite(inode->i_sb);
goto no_delete;
}
 
@@ -294,7 +301,8 @@ void ext4_evict_inode(struct inode *inod
 stop_handle:
ext4_journal_stop(handle);
ext4_orphan_del(NULL, inode);
-   sb_end_intwrite(inode->i_sb);
+   if (freeze_protected)
+   sb_end_intwrite(inode->i_sb);
ext4_xattr_inode_array_free(ea_inode_array);
goto no_delete;
}
@@ -323,7 +331,8 @@ stop_handle:
else
ext4_free_inode(handle, inode);
ext4_journal_stop(handle);
-   sb_end_intwrite(inode->i_sb);
+   if (freeze_protected)
+   sb_end_intwrite(inode->i_sb);
ext4_xattr_inode_array_free(ea_inode_array);
return;
 no_delete:

[PATCH 4.14 209/242] ext4: fix deadlock with fs freezing and EA inodes

2020-12-28 Thread Greg Kroah-Hartman

From: Jan Kara 

commit 46e294efc355c48d1dd4d58501aa56dac461792a upstream.

Xattr code using inodes with large xattr data can end up dropping last
inode reference (and thus deleting the inode) from places like
ext4_xattr_set_entry(). That function is called with transaction started
and so ext4_evict_inode() can deadlock against fs freezing like:

CPU1CPU2

removexattr()   freeze_super()
  vfs_removexattr()
ext4_xattr_set()
  handle = ext4_journal_start()
  ...
  ext4_xattr_set_entry()
iput(old_ea_inode)
  ext4_evict_inode(old_ea_inode)
  sb->s_writers.frozen = SB_FREEZE_FS;
  sb_wait_write(sb, SB_FREEZE_FS);
  ext4_freeze()
jbd2_journal_lock_updates()
  -> blocks waiting for all
 handles to stop
sb_start_intwrite()
  -> blocks as sb is already in SB_FREEZE_FS state

Generally it is advisable to delete inodes from a separate transaction
as it can consume quite some credits however in this case it would be
quite clumsy and furthermore the credits for inode deletion are quite
limited and already accounted for. So just tweak ext4_evict_inode() to
avoid freeze protection if we have transaction already started and thus
it is not really needed anyway.

Cc: sta...@vger.kernel.org
Fixes: dec214d00e0d ("ext4: xattr inode deduplication")
Signed-off-by: Jan Kara 
Reviewed-by: Andreas Dilger 
Link: https://lore.kernel.org/r/20201127110649.24730-1-j...@suse.cz
Signed-off-by: Theodore Ts'o 
Signed-off-by: Greg Kroah-Hartman 

---
 fs/ext4/inode.c |   21 +++--
 1 file changed, 15 insertions(+), 6 deletions(-)

--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -202,6 +202,7 @@ void ext4_evict_inode(struct inode *inod
 */
int extra_credits = 6;
struct ext4_xattr_inode_array *ea_inode_array = NULL;
+   bool freeze_protected = false;
 
trace_ext4_evict_inode(inode);
 
@@ -249,9 +250,14 @@ void ext4_evict_inode(struct inode *inod
 
/*
 * Protect us against freezing - iput() caller didn't have to have any
-* protection against it
-*/
-   sb_start_intwrite(inode->i_sb);
+* protection against it. When we are in a running transaction though,
+* we are already protected against freezing and we cannot grab further
+* protection due to lock ordering constraints.
+*/
+   if (!ext4_journal_current_handle()) {
+   sb_start_intwrite(inode->i_sb);
+   freeze_protected = true;
+   }
 
if (!IS_NOQUOTA(inode))
extra_credits += EXT4_MAXQUOTAS_DEL_BLOCKS(inode->i_sb);
@@ -270,7 +276,8 @@ void ext4_evict_inode(struct inode *inod
 * cleaned up.
 */
ext4_orphan_del(NULL, inode);
-   sb_end_intwrite(inode->i_sb);
+   if (freeze_protected)
+   sb_end_intwrite(inode->i_sb);
goto no_delete;
}
 
@@ -311,7 +318,8 @@ void ext4_evict_inode(struct inode *inod
 stop_handle:
ext4_journal_stop(handle);
ext4_orphan_del(NULL, inode);
-   sb_end_intwrite(inode->i_sb);
+   if (freeze_protected)
+   sb_end_intwrite(inode->i_sb);
ext4_xattr_inode_array_free(ea_inode_array);
goto no_delete;
}
@@ -340,7 +348,8 @@ stop_handle:
else
ext4_free_inode(handle, inode);
ext4_journal_stop(handle);
-   sb_end_intwrite(inode->i_sb);
+   if (freeze_protected)
+   sb_end_intwrite(inode->i_sb);
ext4_xattr_inode_array_free(ea_inode_array);
return;
 no_delete:

[PATCH AUTOSEL 5.4 101/130] btrfs: fix race that causes unnecessary logging of ancestor inodes

2020-12-22 Thread Sasha Levin

From: Filipe Manana 

[ Upstream commit 4d6221d7d83141d58ece6560e9cfd4cc92eab044 ]

When logging an inode and we are checking if we need to log ancestors that
are new, if the previous transaction is still committing we have a time
window where we can unnecessarily log ancestor inodes that were created in
the previous transaction.

The race is described by the following steps:

1) We are at transaction 1000;

2) Directory inode X is created, its generation is set to 1000;

3) The commit for transaction 1000 is started by task A;

4) The task committing transaction 1000 sets the transaction state to
   unblocked, writes the dirty extent buffers and the super blocks, then
   unlocks tree_log_mutex;

5) Inode Y, a regular file, is created under directory inode X, this
   results in starting a new transaction with a generation of 1001;

6) The transaction 1000 commit is unpinning extents. At this point
   fs_info->last_trans_committed still has a value of 999;

7) Task B calls fsync on inode Y and gets a handle for transaction 1001;

8) Task B ends up at log_all_new_ancestors() and then because inode Y has
   only one hard link, ends up at log_new_ancestors_fast(). There it reads
   a value of 999 from fs_info->last_trans_committed, and sees that the
   parent inode X has a generation of 1000, so we end up logging inode X:

 if (inode->generation > fs_info->last_trans_committed) {
 ret = btrfs_log_inode(trans, root, inode,
   LOG_INODE_EXISTS, ctx);
 (...)

   which is not necessary since it was created in the past transaction,
   with a generation of 1000, and that transaction has already committed
   its super blocks - it's still unpinning extents so it has not yet
   updated fs_info->last_trans_committed from 999 to 1000.

   So this just causes us to spend more time logging and allocating and
   writing more tree blocks for the log tree.

So fix this by comparing an inode's generation with the generation of the
transaction our transaction handle refers to - if the inode's generation
matches the generation of the current transaction than we know it is a
new inode we need to log, otherwise don't log it.

This case is often hit when running dbench for a long enough duration.

This patch belongs to a patch set that is comprised of the following
patches:

  btrfs: fix race causing unnecessary inode logging during link and rename
  btrfs: fix race that results in logging old extents during a fast fsync
  btrfs: fix race that causes unnecessary logging of ancestor inodes
  btrfs: fix race that makes inode logging fallback to transaction commit
  btrfs: fix race leading to unnecessary transaction commit when logging inode
  btrfs: do not block inode logging for so long during transaction commit

Performance results are mentioned in the change log of the last patch.

Signed-off-by: Filipe Manana 
Signed-off-by: David Sterba 
Signed-off-by: Sasha Levin 
---
 fs/btrfs/tree-log.c | 6 ++
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index 12182db88222b..72e0ff38646a7 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -5804,7 +5804,6 @@ static int log_new_ancestors(struct btrfs_trans_handle 
*trans,
 
while (true) {
struct btrfs_fs_info *fs_info = root->fs_info;
-   const u64 last_committed = fs_info->last_trans_committed;
struct extent_buffer *leaf = path->nodes[0];
int slot = path->slots[0];
struct btrfs_key search_key;
@@ -5820,7 +5819,7 @@ static int log_new_ancestors(struct btrfs_trans_handle 
*trans,
if (IS_ERR(inode))
return PTR_ERR(inode);
 
-   if (BTRFS_I(inode)->generation > last_committed)
+   if (BTRFS_I(inode)->generation >= trans->transid)
ret = btrfs_log_inode(trans, root, BTRFS_I(inode),
  LOG_INODE_EXISTS,
  0, LLONG_MAX, ctx);
@@ -5862,7 +5861,6 @@ static int log_new_ancestors_fast(struct 
btrfs_trans_handle *trans,
  struct btrfs_log_ctx *ctx)
 {
struct btrfs_root *root = inode->root;
-   struct btrfs_fs_info *fs_info = root->fs_info;
struct dentry *old_parent = NULL;
struct super_block *sb = inode->vfs_inode.i_sb;
int ret = 0;
@@ -5876,7 +5874,7 @@ static int log_new_ancestors_fast(struct 
btrfs_trans_handle *trans,
if (root != inode->root)
break;
 
-   if (inode->generation > fs_info->last_trans_committed) {
+   if (inode->generation >= trans->transid) {
ret = btrfs_log_inode(trans, root, inode,
LOG_INODE_EXISTS, 0, LLONG_MAX, ctx);
if (ret)
-- 
2.27.0

Re: [PATCH] ecryptfs: Fix inodes not being evicted until unmount

2020-12-18 Thread Jeffrey Mitchell

Steps to reproduce issue:
Mount nfs
Inside the nfs mount, create back and front directories for ecryptfs
Use ecryptfs to mount one directory onto the other
Create a few files inside the ecryptfs mount
Set ftrace to monitor ecryptfs_do_unlink() and ecryptfs_evict_inode()
Delete files
Unmount ecryptfs

ftrace output before patch:
# tracer: function_graph
#
# CPU  DURATION  FUNCTION CALLS
# | |   | |   |   |   |
 1)   |  /* Tracing enabled */
 1)   |  /* Writing to files */
 1)   |  /* Unlinking first file */
 0) * 27804.34 us |  ecryptfs_do_unlink();
 1)   |  /* rm'ing second file */
 --
 0)  unlink-1384   =>rm-1385
 --

 0) * 32828.63 us |  ecryptfs_do_unlink();
 1)   |  /* Unmounting eCryptFS */
 --
 1)  trace-a-1365  =>  umount-1387
 --

 1)   2.408 us|  ecryptfs_evict_inode();
 1)   8.914 us|  ecryptfs_evict_inode();
 1)   3.344 us|  ecryptfs_evict_inode();

ftrace output after patch:
# tracer: function_graph
#
# CPU  DURATION  FUNCTION CALLS
# | |   | |   |   |   |
 0)   |  /* Tracing enabled */
 0)   |  /* Writing to files */
 0)   |  /* Unlinking first file */
 1) * 24728.81 us |  ecryptfs_do_unlink();
 1) + 20.923 us   |  ecryptfs_evict_inode();
 0)   |  /* rm'ing second file */
 --
 1)  unlink-1333   =>rm-1334
 --

 1) * 41093.71 us |  ecryptfs_do_unlink();
 1) + 11.384 us   |  ecryptfs_evict_inode();
 0)   |  /* Unmounting eCryptFS */
 --
 0)  trace-a-1314  =>  umount-1336
 --

 0)   2.986 us|  ecryptfs_evict_inode();

On Fri, Dec 18, 2020 at 01:07:30PM -0600, Jeffrey Mitchell wrote:
> On asynchronous base filesystems like NFS, eCryptFS leaves inodes for
> deleted files in the cache until unmounting. Change call in
> ecryptfs_do_unlink() from set_nlink() to drop_nlink() in order to reliably
> evict inodes from the cache even on top of NFS.
> 
> Signed-off-by: Dan Robertson 
> Signed-off-by: Jeffrey Mitchell 
> ---
>  fs/ecryptfs/inode.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/fs/ecryptfs/inode.c b/fs/ecryptfs/inode.c
> index e23752d..f7594b6 100644
> --- a/fs/ecryptfs/inode.c
> +++ b/fs/ecryptfs/inode.c
> @@ -147,7 +147,7 @@ static int ecryptfs_do_unlink(struct inode *dir, struct 
> dentry *dentry,
>   goto out_unlock;
>   }
>   fsstack_copy_attr_times(dir, lower_dir_inode);
> - set_nlink(inode, ecryptfs_inode_to_lower(inode)->i_nlink);
> + drop_nlink(inode);
>   inode->i_ctime = dir->i_ctime;
>  out_unlock:
>   dput(lower_dentry);
> -- 
> 2.7.4
>

[PATCH] ecryptfs: Fix inodes not being evicted until unmount

2020-12-18 Thread Jeffrey Mitchell

On asynchronous base filesystems like NFS, eCryptFS leaves inodes for
deleted files in the cache until unmounting. Change call in
ecryptfs_do_unlink() from set_nlink() to drop_nlink() in order to reliably
evict inodes from the cache even on top of NFS.

Signed-off-by: Dan Robertson 
Signed-off-by: Jeffrey Mitchell 
---
 fs/ecryptfs/inode.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/ecryptfs/inode.c b/fs/ecryptfs/inode.c
index e23752d..f7594b6 100644
--- a/fs/ecryptfs/inode.c
+++ b/fs/ecryptfs/inode.c
@@ -147,7 +147,7 @@ static int ecryptfs_do_unlink(struct inode *dir, struct 
dentry *dentry,
goto out_unlock;
}
fsstack_copy_attr_times(dir, lower_dir_inode);
-   set_nlink(inode, ecryptfs_inode_to_lower(inode)->i_nlink);
+   drop_nlink(inode);
inode->i_ctime = dir->i_ctime;
 out_unlock:
dput(lower_dentry);
-- 
2.7.4

Re: WARNING: filesystem loop0 was created with 512 inodes, the real maximum is 511, mounting anyway

2020-12-07 Thread Dmitry Vyukov

On Mon, Sep 28, 2020 at 11:08 AM Tigran Aivazian
 wrote:
>
> On Mon, 28 Sep 2020 at 09:29, Dmitry Vyukov  wrote:
> > On Mon, Sep 28, 2020 at 10:23 AM Tigran Aivazian
> > > No, this is not an issue. In the latest change to BFS I added the
> > > following comment to the header fs/bfs/bfs.h, which explains it:
> > >
> > > /* In theory BFS supports up to 512 inodes, numbered from 2 (for /) up
> > > to 513 inclusive.
> > >In actual fact, attempting to create the 512th inode (i.e. inode
> > > No. 513 or file No. 511)
> > >will fail with ENOSPC in bfs_add_entry(): the root directory cannot
> > > contain so many entries, counting '..'.
> > >So, mkfs.bfs(8) should really limit its -N option to 511 and not
> > > 512. For now, we just print a warning
> > >if a filesystem is mounted with such "impossible to fill up" number
> > > of inodes */
> >
> > There are rules for use of "WARNING" in output required to support
> > kernel testing:
> > https://github.com/torvalds/linux/blob/master/include/asm-generic/bug.h#L67-L80
> > This seems to be triggerable by exteranal inputs and breaks these rules.
>
> Thank you, I didn't know about these rules. Ok, then, since this
> warning does not "need prompt attention if it should ever occur at
> runtime", the easiest solution is to change "WARNING" to lower case
> "warning" in that printk in fs/bfs/inode.c:
>
> --- fs/bfs/inode.c.0 2020-09-28 10:03:00.658549556 +0100
> +++ fs/bfs/inode.c 2020-09-28 10:03:05.408548250 +0100
> @@ -351,7 +351,7 @@
>
>   info->si_lasti = (le32_to_cpu(bfs_sb->s_start) - BFS_BSIZE) /
> sizeof(struct bfs_inode) + BFS_ROOT_INO - 1;
>   if (info->si_lasti == BFS_MAX_LASTI)
> - printf("WARNING: filesystem %s was created with 512 inodes, the real
> maximum is 511, mounting anyway\n", s->s_id);
> + printf("warning: filesystem %s was created with 512 inodes, the real
> maximum is 511, mounting anyway\n", s->s_id);
>   else if (info->si_lasti > BFS_MAX_LASTI) {
>   printf("Impossible last inode number %lu > %d on %s\n",
> info->si_lasti, BFS_MAX_LASTI, s->s_id);
>   goto out1;
>
> If you want to submit this patch to the appropriate place(s), feel
> free to do this -- I approve it. If the comment in asm/bug.h is
> inaccurate and its mention of "BUG/WARNING" implies the lowercase
> "bug/warning" also, then one can remove the prefix "warning: " from
> the patch altogether and proper case "filesystem" to "Filesystem".
>
> Kind regards,
> Tigran
>
> Acked-By: Tigran Aivazian 
> Approved-By: Tigran Aivazian 


#syz fix: bfs: don't use WARNING: string when it's just info.

Re: WARNING: filesystem loop1 was created with 512 inodes, the real maximum is 511, mounting anyway

2020-12-07 Thread Dmitry Vyukov

#syz fix: bfs: don't use WARNING: string when it's just info.

On Mon, Sep 28, 2020 at 8:10 PM syzbot
 wrote:
>
> Hello,
>
> syzbot found the following issue on:
>
> HEAD commit:c9c9e6a4 Merge tag 'trace-v5.9-rc5-2' of git://git.kernel...
> git tree:   upstream
> console output: https://syzkaller.appspot.com/x/log.txt?x=138b09d990
> kernel config:  https://syzkaller.appspot.com/x/.config?x=5f4c828c9e3cef97
> dashboard link: https://syzkaller.appspot.com/bug?extid=2435de7315366e15f0ca
> compiler:   gcc (GCC) 10.1.0-syz 20200507
>
> Unfortunately, I don't have any reproducer for this issue yet.
>
> IMPORTANT: if you fix the issue, please add the following tag to the commit:
> Reported-by: syzbot+2435de7315366e15f...@syzkaller.appspotmail.com
>
> BFS-fs: bfs_fill_super(): WARNING: filesystem loop1 was created with 512 
> inodes, the real maximum is 511, mounting anyway
> BFS-fs: bfs_fill_super(): Inode 0x0002 corrupted on loop1
> BFS-fs: bfs_fill_super(): WARNING: filesystem loop1 was created with 512 
> inodes, the real maximum is 511, mounting anyway
> BFS-fs: bfs_fill_super(): Inode 0x0002 corrupted on loop1
>
>
> ---
> This report is generated by a bot. It may contain errors.
> See https://goo.gl/tpsmEJ for more information about syzbot.
> syzbot engineers can be reached at syzkal...@googlegroups.com.
>
> syzbot will keep track of this issue. See:
> https://goo.gl/tpsmEJ#status for how to communicate with syzbot.
>
> --
> You received this message because you are subscribed to the Google Groups 
> "syzkaller-bugs" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to syzkaller-bugs+unsubscr...@googlegroups.com.
> To view this discussion on the web visit 
> https://groups.google.com/d/msgid/syzkaller-bugs/6f1f5d05b06394d0%40google.com.

Re: WARNING: filesystem loop4 was created with 512 inodes, the real maximum is 511, mounting anyway

2020-12-07 Thread Dmitry Vyukov

#syz fix: bfs: don't use WARNING: string when it's just info.

On Sat, Nov 21, 2020 at 8:33 AM syzbot
 wrote:
>
> Hello,
>
> syzbot found the following issue on:
>
> HEAD commit:09162bc3 Linux 5.10-rc4
> git tree:   upstream
> console output: https://syzkaller.appspot.com/x/log.txt?x=103f4fbe50
> kernel config:  https://syzkaller.appspot.com/x/.config?x=75292221eb79ace2
> dashboard link: https://syzkaller.appspot.com/bug?extid=1a219abc12077a390bc9
> compiler:   gcc (GCC) 10.1.0-syz 20200507
>
> Unfortunately, I don't have any reproducer for this issue yet.
>
> IMPORTANT: if you fix the issue, please add the following tag to the commit:
> Reported-by: syzbot+1a219abc12077a390...@syzkaller.appspotmail.com
>
> BFS-fs: bfs_fill_super(): WARNING: filesystem loop4 was created with 512 
> inodes, the real maximum is 511, mounting anyway
> BFS-fs: bfs_fill_super(): Last block not available on loop4: 1507328
> BFS-fs: bfs_fill_super(): WARNING: filesystem loop4 was created with 512 
> inodes, the real maximum is 511, mounting anyway
> BFS-fs: bfs_fill_super(): Last block not available on loop4: 1507328
>
>
> ---
> This report is generated by a bot. It may contain errors.
> See https://goo.gl/tpsmEJ for more information about syzbot.
> syzbot engineers can be reached at syzkal...@googlegroups.com.
>
> syzbot will keep track of this issue. See:
> https://goo.gl/tpsmEJ#status for how to communicate with syzbot.
>
> --
> You received this message because you are subscribed to the Google Groups 
> "syzkaller-bugs" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to syzkaller-bugs+unsubscr...@googlegroups.com.
> To view this discussion on the web visit 
> https://groups.google.com/d/msgid/syzkaller-bugs/bf566005b498f95f%40google.com.

Re: WARNING: filesystem loop2 was created with 512 inodes, the real maximum is 511, mounting anyway

2020-12-07 Thread Dmitry Vyukov

#syz fix: bfs: don't use WARNING: string when it's just info.

On Sat, Nov 21, 2020 at 8:33 AM syzbot
 wrote:
>
> Hello,
>
> syzbot found the following issue on:
>
> HEAD commit:09162bc3 Linux 5.10-rc4
> git tree:   upstream
> console output: https://syzkaller.appspot.com/x/log.txt?x=16e9a48650
> kernel config:  https://syzkaller.appspot.com/x/.config?x=e93bbe4ce29223b
> dashboard link: https://syzkaller.appspot.com/bug?extid=ae3ff0bb2a0133596a5b
> compiler:   clang version 11.0.0 
> (https://github.com/llvm/llvm-project.git 
> ca2dcbd030eadbf0aa9b660efe864ff08af6e18b)
>
> Unfortunately, I don't have any reproducer for this issue yet.
>
> IMPORTANT: if you fix the issue, please add the following tag to the commit:
> Reported-by: syzbot+ae3ff0bb2a0133596...@syzkaller.appspotmail.com
>
> BFS-fs: bfs_fill_super(): WARNING: filesystem loop2 was created with 512 
> inodes, the real maximum is 511, mounting anyway
> BFS-fs: bfs_fill_super(): Last block not available on loop2: 1507328
> BFS-fs: bfs_fill_super(): WARNING: filesystem loop2 was created with 512 
> inodes, the real maximum is 511, mounting anyway
> BFS-fs: bfs_fill_super(): Last block not available on loop2: 1507328
>
>
> ---
> This report is generated by a bot. It may contain errors.
> See https://goo.gl/tpsmEJ for more information about syzbot.
> syzbot engineers can be reached at syzkal...@googlegroups.com.
>
> syzbot will keep track of this issue. See:
> https://goo.gl/tpsmEJ#status for how to communicate with syzbot.
>
> --
> You received this message because you are subscribed to the Google Groups 
> "syzkaller-bugs" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to syzkaller-bugs+unsubscr...@googlegroups.com.
> To view this discussion on the web visit 
> https://groups.google.com/d/msgid/syzkaller-bugs/c2f72c05b498f9bd%40google.com.

Re: WARNING: filesystem loop3 was created with 512 inodes, the real maximum is 511, mounting anyway

2020-12-07 Thread Dmitry Vyukov

#syz fix: bfs: don't use WARNING: string when it's just info.

On Thu, Sep 24, 2020 at 11:40 AM syzbot
 wrote:
>
> Hello,
>
> syzbot found the following issue on:
>
> HEAD commit:98477740 Merge branch 'rcu/urgent' of git://git.kernel.org..
> git tree:   upstream
> console output: https://syzkaller.appspot.com/x/log.txt?x=15964ec390
> kernel config:  https://syzkaller.appspot.com/x/.config?x=6f192552d75898a1
> dashboard link: https://syzkaller.appspot.com/bug?extid=293714df4fe354fae488
> compiler:   gcc (GCC) 10.1.0-syz 20200507
>
> Unfortunately, I don't have any reproducer for this issue yet.
>
> IMPORTANT: if you fix the issue, please add the following tag to the commit:
> Reported-by: syzbot+293714df4fe354fae...@syzkaller.appspotmail.com
>
> BFS-fs: bfs_fill_super(): WARNING: filesystem loop3 was created with 512 
> inodes, the real maximum is 511, mounting anyway
> BFS-fs: bfs_fill_super(): Inode 0x0002 corrupted on loop3
>
>
> ---
> This report is generated by a bot. It may contain errors.
> See https://goo.gl/tpsmEJ for more information about syzbot.
> syzbot engineers can be reached at syzkal...@googlegroups.com.
>
> syzbot will keep track of this issue. See:
> https://goo.gl/tpsmEJ#status for how to communicate with syzbot.
>
> --
> You received this message because you are subscribed to the Google Groups 
> "syzkaller-bugs" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to syzkaller-bugs+unsubscr...@googlegroups.com.
> To view this discussion on the web visit 
> https://groups.google.com/d/msgid/syzkaller-bugs/28201305b00bfdba%40google.com.

Re: WARNING: filesystem loop0 was created with 512 inodes, the real maximum is 511, mounting anywa

2020-12-07 Thread Dmitry Vyukov

#syz fix: bfs: don't use WARNING: string when it's just info.

On Mon, Dec 7, 2020 at 1:53 PM syzbot
 wrote:
>
> Hello,
>
> syzbot found the following issue on:
>
> HEAD commit:34816d20 Merge tag 'gfs2-v5.10-rc5-fixes' of git://git.ker..
> git tree:   upstream
> console output: https://syzkaller.appspot.com/x/log.txt?x=157dad0750
> kernel config:  https://syzkaller.appspot.com/x/.config?x=b3a044ccf5b03ac4
> dashboard link: https://syzkaller.appspot.com/bug?extid=02c44c7f92e70a73730a
> compiler:   gcc (GCC) 10.1.0-syz 20200507
> syz repro:  https://syzkaller.appspot.com/x/repro.syz?x=152b05ab50
> C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=14fc3fad50
>
> IMPORTANT: if you fix the issue, please add the following tag to the commit:
> Reported-by: syzbot+02c44c7f92e70a737...@syzkaller.appspotmail.com
>
> BFS-fs: bfs_fill_super(): WARNING: filesystem loop0 was created with 512 
> inodes, the real maximum is 511, mounting anywa
>
>
> ---
> This report is generated by a bot. It may contain errors.
> See https://goo.gl/tpsmEJ for more information about syzbot.
> syzbot engineers can be reached at syzkal...@googlegroups.com.
>
> syzbot will keep track of this issue. See:
> https://goo.gl/tpsmEJ#status for how to communicate with syzbot.
> syzbot can test patches for this issue, for details see:
> https://goo.gl/tpsmEJ#testing-patches
>
> --
> You received this message because you are subscribed to the Google Groups 
> "syzkaller-bugs" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to syzkaller-bugs+unsubscr...@googlegroups.com.
> To view this discussion on the web visit 
> https://groups.google.com/d/msgid/syzkaller-bugs/fbb4f505b5df4eea%40google.com.

WARNING: filesystem loop0 was created with 512 inodes, the real maximum is 511, mounting anywa

2020-12-07 Thread syzbot

Hello,

syzbot found the following issue on:

HEAD commit:34816d20 Merge tag 'gfs2-v5.10-rc5-fixes' of git://git.ker..
git tree:   upstream
console output: https://syzkaller.appspot.com/x/log.txt?x=157dad0750
kernel config:  https://syzkaller.appspot.com/x/.config?x=b3a044ccf5b03ac4
dashboard link: https://syzkaller.appspot.com/bug?extid=02c44c7f92e70a73730a
compiler:   gcc (GCC) 10.1.0-syz 20200507
syz repro:  https://syzkaller.appspot.com/x/repro.syz?x=152b05ab50
C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=14fc3fad50

IMPORTANT: if you fix the issue, please add the following tag to the commit:
Reported-by: syzbot+02c44c7f92e70a737...@syzkaller.appspotmail.com

BFS-fs: bfs_fill_super(): WARNING: filesystem loop0 was created with 512 
inodes, the real maximum is 511, mounting anywa


---
This report is generated by a bot. It may contain errors.
See https://goo.gl/tpsmEJ for more information about syzbot.
syzbot engineers can be reached at syzkal...@googlegroups.com.

syzbot will keep track of this issue. See:
https://goo.gl/tpsmEJ#status for how to communicate with syzbot.
syzbot can test patches for this issue, for details see:
https://goo.gl/tpsmEJ#testing-patches

[PATCH 10/10] proc: Use a list of inodes to flush from proc

2020-12-03 Thread Wen Yang

From: "Eric W. Biederman" 

[ Upstream commit 7bc3e6e55acf065500a24621f3b313e7e5998acf ]

Rework the flushing of proc to use a list of directory inodes that
need to be flushed.

The list is kept on struct pid not on struct task_struct, as there is
a fixed connection between proc inodes and pids but at least for the
case of de_thread the pid of a task_struct changes.

This removes the dependency on proc_mnt which allows for different
mounts of proc having different mount options even in the same pid
namespace and this allows for the removal of proc_mnt which will
trivially the first mount of proc to honor it's mount options.

This flushing remains an optimization.  The functions
pid_delete_dentry and pid_revalidate ensure that ordinary dcache
management will not attempt to use dentries past the point their
respective task has died.  When unused the shrinker will
eventually be able to remove these dentries.

There is a case in de_thread where proc_flush_pid can be
called early for a given pid.  Which winds up being
safe (if suboptimal) as this is just an optiimization.

Only pid directories are put on the list as the other
per pid files are children of those directories and
d_invalidate on the directory will get them as well.

So that the pid can be used during flushing it's reference count is
taken in release_task and dropped in proc_flush_pid.  Further the call
of proc_flush_pid is moved after the tasklist_lock is released in
release_task so that it is certain that the pid has already been
unhashed when flushing it taking place.  This removes a small race
where a dentry could recreated.

As struct pid is supposed to be small and I need a per pid lock
I reuse the only lock that currently exists in struct pid the
the wait_pidfd.lock.

The net result is that this adds all of this functionality
with just a little extra list management overhead and
a single extra pointer in struct pid.

v2: Initialize pid->inodes.  I somehow failed to get that
initialization into the initial version of the patch.  A boot
failure was reported by "kernel test robot ", and
failure to initialize that pid->inodes matches all of the reported
symptoms.

Signed-off-by: Eric W. Biederman 
Fixes: f333c700c610 ("pidns: Add a limit on the number of pid namespaces")
Fixes: 60347f6716aa ("pid namespaces: prepare proc_flust_task() to flush 
entries from multiple proc trees")
Cc:  # 4.9.x: b3e5838: clone: add CLONE_PIDFD
Cc:  # 4.9.x: b53b0b9: pidfd: add polling support
Cc:  # 4.9.x: db978da: proc: Pass file mode to 
proc_pid_make_inode
Cc:  # 4.9.x: 68eb94f: proc: Better ownership of files 
for non-dumpable tasks in user namespaces
Cc:  # 4.9.x: e3912ac: proc: use %u for pid printing 
and slightly less stack
Cc:  # 4.9.x: 0afa5ca: proc: Rename in proc_inode 
rename sysctl_inodes sibling_inodes
Cc:  # 4.9.x: 26dbc60: proc: Generalize 
proc_sys_prune_dcache into proc_prune_siblings_dcache
Cc:  # 4.9.x: 7144801: proc: Clear the pieces of 
proc_inode that proc_evict_inode cares about
Cc:  # 4.9.x: f90f3ca: Use d_invalidate in 
proc_prune_siblings_dcache
Cc:  # 4.9.x
(proc: fix up cherry-pick conflicts for 7bc3e6e55acf)
Signed-off-by: Wen Yang 
---
 fs/proc/base.c  | 111 
 fs/proc/inode.c |   2 +-
 fs/proc/internal.h  |   1 +
 include/linux/pid.h |   1 +
 include/linux/proc_fs.h |   4 +-
 kernel/exit.c   |   5 ++-
 kernel/pid.c|   1 +
 7 files changed, 45 insertions(+), 80 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index 3502a40..11caf35 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -1728,11 +1728,25 @@ void task_dump_owner(struct task_struct *task, mode_t 
mode,
*rgid = gid;
 }
 
+void proc_pid_evict_inode(struct proc_inode *ei)
+{
+   struct pid *pid = ei->pid;
+
+   if (S_ISDIR(ei->vfs_inode.i_mode)) {
+   spin_lock(>wait_pidfd.lock);
+   hlist_del_init_rcu(>sibling_inodes);
+   spin_unlock(>wait_pidfd.lock);
+   }
+
+   put_pid(pid);
+}
+
 struct inode *proc_pid_make_inode(struct super_block * sb,
  struct task_struct *task, umode_t mode)
 {
struct inode * inode;
struct proc_inode *ei;
+   struct pid *pid;
 
/* We need a new inode */
 
@@ -1750,10 +1764,18 @@ struct inode *proc_pid_make_inode(struct super_block * 
sb,
/*
 * grab the reference to task.
 */
-   ei->pid = get_task_pid(task, PIDTYPE_PID);
-   if (!ei->pid)
+   pid = get_task_pid(task, PIDTYPE_PID);
+   if (!pid)
goto out_unlock;
 
+   /* Let the pid remember us for quick removal */
+   ei->pid = pid;
+   if (S_ISDIR(mode)) {
+   spin_lock(>wait_pidfd.lock);
+   hlist_add_head_rcu(>sibling_inodes, >inodes);
+   spin_unlock(>wait_pidfd.lock);
+   }

Re: WARNING: filesystem loop5 was created with 512 inodes, the real maximum is 511, mounting anyway

2020-12-03 Thread Dmitry Vyukov

On Thu, Dec 3, 2020 at 1:55 PM Dmitry Vyukov  wrote:
>
> On Thu, Dec 3, 2020 at 5:15 AM Randy Dunlap  wrote:
> >
> > On 12/1/20 1:17 PM, Randy Dunlap wrote:
> > > On 11/30/20 11:47 PM, Dmitry Vyukov wrote:
> > >> On Tue, Dec 1, 2020 at 2:03 AM Randy Dunlap  
> > >> wrote:
> > >>>
> > >>> On 11/30/20 12:43 AM, Dmitry Vyukov wrote:
> > >>>> On Mon, Nov 30, 2020 at 5:29 AM Randy Dunlap  
> > >>>> wrote:
> > >>>>>
> > >>>>> On 11/27/20 4:32 AM, syzbot wrote:
> > >>>>>> Hello,
> > >>>>>>
> > >>>>>> syzbot found the following issue on:
> > >>>>>>
> > >>>>>> HEAD commit:418baf2c Linux 5.10-rc5
> > >>>>>> git tree:   upstream
> > >>>>>> console output: 
> > >>>>>> https://syzkaller.appspot.com/x/log.txt?x=171555b950
> > >>>>>> kernel config:  
> > >>>>>> https://syzkaller.appspot.com/x/.config?x=b81aff78c272da44
> > >>>>>> dashboard link: 
> > >>>>>> https://syzkaller.appspot.com/bug?extid=3fd34060f26e766536ff
> > >>>>>> compiler:   gcc (GCC) 10.1.0-syz 20200507
> > >>>>>>
> > >>>>>> Unfortunately, I don't have any reproducer for this issue yet.
> > >>>>>>
> > >>>>>> IMPORTANT: if you fix the issue, please add the following tag to the 
> > >>>>>> commit:
> > >>>>>> Reported-by: syzbot+3fd34060f26e76653...@syzkaller.appspotmail.com
> > >>>>>>
> > >>>>>> BFS-fs: bfs_fill_super(): loop5 is unclean, continuing
> > >>>>>> BFS-fs: bfs_fill_super(): WARNING: filesystem loop5 was created with 
> > >>>>>> 512 inodes, the real maximum is 511, mounting anyway
> > >>>>>> BFS-fs: bfs_fill_super(): Last block not available on loop5: 120
> > >>>>>> BFS-fs: bfs_fill_super(): loop5 is unclean, continuing
> > >>>>>> BFS-fs: bfs_fill_super(): WARNING: filesystem loop5 was created with 
> > >>>>>> 512 inodes, the real maximum is 511, mounting anyway
> > >>>>>> BFS-fs: bfs_fill_super(): Last block not available on loop5: 120
> > >>>>>>
> > >>>>>>
> > >>>>>> ---
> > >>>>>> This report is generated by a bot. It may contain errors.
> > >>>>>> See https://goo.gl/tpsmEJ for more information about syzbot.
> > >>>>>> syzbot engineers can be reached at syzkal...@googlegroups.com.
> > >>>>>>
> > >>>>>> syzbot will keep track of this issue. See:
> > >>>>>> https://goo.gl/tpsmEJ#status for how to communicate with syzbot.
> >
> > ...
> >
> > >>>> Hi Randy,
> > >>>>
> > >>>> I see this bug was reported with a reproducer:
> > >>>> https://syzkaller.appspot.com/bug?id=a32ebd5db2f7c957b82cf54b97bdecf367bf0421
> > >>>> I assume it's a dup of this one.
> > >>>
> > >>> Sure, looks the same.
> > >>>
> > >>>> If you need the image itself, you can dump it to a file in the C
> > >>>> reproducer inside of syz_mount_image before mount call.
> > >>>
> > >>> Yes, got that.
> > >>>
> > >>> What outcome or result are you looking for here?
> > >>> Or what do you see as the problem?
> > >>
> > >> Hi Randy,
> > >>
> > >> "WARNING:" in kernel output is supposed to mean a kernel source bug.
> > >> Presence of that kernel bug is what syzbot has reported.
> > >>
> > >> Note: the bug may be a misuse of the "WARNING:" for invalid user
> > >> inputs in output as well :)
> > >
> > >
> > > [adding Al Viro]
> > >
> > > Hi Dmitry,
> > >
> > > I expect that the "WARNING:" message is being interpreted incorrectly 
> > > here,
> > > but that's a minor issue IMO.
> > >
> > >   if (info->si_lasti == BFS_MAX_LASTI)
> > >   printf("WARNING: filesystem %s was created with 512 inodes, 
> > > the real maximum is 511, mounting anyway\n", s->s_id);
> > >
> >
> > ...
> >
> > >
> > >
> > > However, in testing this, I see that the BFS image is not mounted
> > > on /dev/loop# at all.
> > >
> > > 'mount' says:
> > >
> > > # mount -t bfs -o loop bfsfilesyz000.img  /mnt/stand
> > > mount: /mnt/stand: mount(2) system call failed: Not a directory.
> > >
> > > (but it is a directory)
> > >
> > > and I have tracked that down to fs/namespace.c::graft_tree()
> > > returning -ENOTDIR, but I don't know why that is happening.
> > >
> > >
> > > Al, can you provide any insights on this?
> >
> > OK, with Al's help, here is the situation.
> >
> > If I use a regular file instead of a directory, the mount
> > command succeeds.
> >
> > The printk() from fs/bfs/inode.c that uses the WARNING: string
> > is not a WARN() or WARN_ON(). It's just a printk().
> >
> >  says:
> >
> >  * Do not include "BUG"/"WARNING" in format strings manually to make these
> >  * conditions distinguishable from kernel issues.
> >
> > so if I change fs/bfs/inode.c to use "warning:" or "Warning," or "Note:",
> > this little problem should go away.  Is that correct?
>
> Hi,
>
> Yes, any of these prefixes will work (not be considered as a kernel
> issue). syzkaller only matches "WARNING:" verbatim. I don't know about
> all other kernel testing systems, but at least it's distinguishable.
>
> Maybe also worth adding "bfs:" prefix for cases when people stare at
> dmesg afterwards.

Oh, sorry, there are already enough prefixes (BFS-fs: bfs_fill_super():).

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 1898 matches

Mail list logo