Re: [PATCH v26 07/12] landlock: Support filesystem access-control

2021-01-16 Thread Mickaël Salaün


On 15/01/2021 19:31, Jann Horn wrote:
> On Fri, Jan 15, 2021 at 10:10 AM Mickaël Salaün  wrote:
>> On 14/01/2021 23:43, Jann Horn wrote:
>>> On Thu, Jan 14, 2021 at 7:54 PM Mickaël Salaün  wrote:
 On 14/01/2021 04:22, Jann Horn wrote:
> On Wed, Dec 9, 2020 at 8:28 PM Mickaël Salaün  wrote:
>> Thanks to the Landlock objects and ruleset, it is possible to identify
>> inodes according to a process's domain.  To enable an unprivileged
>> process to express a file hierarchy, it first needs to open a directory
>> (or a file) and pass this file descriptor to the kernel through
>> landlock_add_rule(2).  When checking if a file access request is
>> allowed, we walk from the requested dentry to the real root, following
>> the different mount layers.  The access to each "tagged" inodes are
>> collected according to their rule layer level, and ANDed to create
>> access to the requested file hierarchy.  This makes possible to identify
>> a lot of files without tagging every inodes nor modifying the
>> filesystem, while still following the view and understanding the user
>> has from the filesystem.
>>
>> Add a new ARCH_EPHEMERAL_INODES for UML because it currently does not
>> keep the same struct inodes for the same inodes whereas these inodes are
>> in use.
>>
>> This commit adds a minimal set of supported filesystem access-control
>> which doesn't enable to restrict all file-related actions.  This is the
>> result of multiple discussions to minimize the code of Landlock to ease
>> review.  Thanks to the Landlock design, extending this access-control
>> without breaking user space will not be a problem.  Moreover, seccomp
>> filters can be used to restrict the use of syscall families which may
>> not be currently handled by Landlock.
> [...]
>> +static bool check_access_path_continue(
>> +   const struct landlock_ruleset *const domain,
>> +   const struct path *const path, const u32 access_request,
>> +   u64 *const layer_mask)
>> +{
> [...]
>> +   /*
>> +* An access is granted if, for each policy layer, at least one 
>> rule
>> +* encountered on the pathwalk grants the access, regardless of 
>> their
>> +* position in the layer stack.  We must then check not-yet-seen 
>> layers
>> +* for each inode, from the last one added to the first one.
>> +*/
>> +   for (i = 0; i < rule->num_layers; i++) {
>> +   const struct landlock_layer *const layer = 
>> >layers[i];
>> +   const u64 layer_level = BIT_ULL(layer->level - 1);
>> +
>> +   if (!(layer_level & *layer_mask))
>> +   continue;
>> +   if ((layer->access & access_request) != access_request)
>> +   return false;
>> +   *layer_mask &= ~layer_level;
>
> Hmm... shouldn't the last 5 lines be replaced by the following?
>
> if ((layer->access & access_request) == access_request)
> *layer_mask &= ~layer_level;
>
> And then, since this function would always return true, you could
> change its return type to "void".
>
>
> As far as I can tell, the current version will still, if a ruleset
> looks like this:
>
> /usr read+write
> /usr/lib/ read
>
> reject write access to /usr/lib, right?

 If these two rules are from different layers, then yes it would work as
 intended. However, if these rules are from the same layer the path walk
 will not stop at /usr/lib but go down to /usr, which grants write
 access.
>>>
>>> I don't see why the code would do what you're saying it does. And an
>>> experiment seems to confirm what I said; I checked out landlock-v26,
>>> and the behavior I get is:
>>
>> There is a misunderstanding, I was responding to your proposition to
>> modify check_access_path_continue(), not about the behavior of landlock-v26.
>>
>>>
>>> user@vm:~/landlock$ dd if=/dev/null of=/tmp/aaa
>>> 0+0 records in
>>> 0+0 records out
>>> 0 bytes copied, 0.00106365 s, 0.0 kB/s
>>> user@vm:~/landlock$ LL_FS_RO='/lib' LL_FS_RW='/' ./sandboxer dd
>>> if=/dev/null of=/tmp/aaa
>>> 0+0 records in
>>> 0+0 records out
>>> 0 bytes copied, 0.000491814 s, 0.0 kB/s
>>> user@vm:~/landlock$ LL_FS_RO='/tmp' LL_FS_RW='/' ./sandboxer dd
>>> if=/dev/null of=/tmp/aaa
>>> dd: failed to open '/tmp/aaa': Permission denied
>>> user@vm:~/landlock$
>>>
>>> Granting read access to /tmp prevents writing to it, even though write
>>> access was granted to /.
>>>
>>
>> It indeed works like this with landlock-v26. However, with your above
>> proposition, it would work like this:
>>
>> $ LL_FS_RO='/tmp' LL_FS_RW='/' ./sandboxer dd if=/dev/null of=/tmp/aaa
>> 0+0 records in
>> 0+0 records out
>> 0 bytes copied, 0.000187265 s, 0.0 kB/s
>>

Re: [PATCH v26 07/12] landlock: Support filesystem access-control

2021-01-15 Thread Jann Horn
On Fri, Jan 15, 2021 at 10:10 AM Mickaël Salaün  wrote:
> On 14/01/2021 23:43, Jann Horn wrote:
> > On Thu, Jan 14, 2021 at 7:54 PM Mickaël Salaün  wrote:
> >> On 14/01/2021 04:22, Jann Horn wrote:
> >>> On Wed, Dec 9, 2020 at 8:28 PM Mickaël Salaün  wrote:
>  Thanks to the Landlock objects and ruleset, it is possible to identify
>  inodes according to a process's domain.  To enable an unprivileged
>  process to express a file hierarchy, it first needs to open a directory
>  (or a file) and pass this file descriptor to the kernel through
>  landlock_add_rule(2).  When checking if a file access request is
>  allowed, we walk from the requested dentry to the real root, following
>  the different mount layers.  The access to each "tagged" inodes are
>  collected according to their rule layer level, and ANDed to create
>  access to the requested file hierarchy.  This makes possible to identify
>  a lot of files without tagging every inodes nor modifying the
>  filesystem, while still following the view and understanding the user
>  has from the filesystem.
> 
>  Add a new ARCH_EPHEMERAL_INODES for UML because it currently does not
>  keep the same struct inodes for the same inodes whereas these inodes are
>  in use.
> 
>  This commit adds a minimal set of supported filesystem access-control
>  which doesn't enable to restrict all file-related actions.  This is the
>  result of multiple discussions to minimize the code of Landlock to ease
>  review.  Thanks to the Landlock design, extending this access-control
>  without breaking user space will not be a problem.  Moreover, seccomp
>  filters can be used to restrict the use of syscall families which may
>  not be currently handled by Landlock.
> >>> [...]
>  +static bool check_access_path_continue(
>  +   const struct landlock_ruleset *const domain,
>  +   const struct path *const path, const u32 access_request,
>  +   u64 *const layer_mask)
>  +{
> >>> [...]
>  +   /*
>  +* An access is granted if, for each policy layer, at least one 
>  rule
>  +* encountered on the pathwalk grants the access, regardless of 
>  their
>  +* position in the layer stack.  We must then check not-yet-seen 
>  layers
>  +* for each inode, from the last one added to the first one.
>  +*/
>  +   for (i = 0; i < rule->num_layers; i++) {
>  +   const struct landlock_layer *const layer = 
>  >layers[i];
>  +   const u64 layer_level = BIT_ULL(layer->level - 1);
>  +
>  +   if (!(layer_level & *layer_mask))
>  +   continue;
>  +   if ((layer->access & access_request) != access_request)
>  +   return false;
>  +   *layer_mask &= ~layer_level;
> >>>
> >>> Hmm... shouldn't the last 5 lines be replaced by the following?
> >>>
> >>> if ((layer->access & access_request) == access_request)
> >>> *layer_mask &= ~layer_level;
> >>>
> >>> And then, since this function would always return true, you could
> >>> change its return type to "void".
> >>>
> >>>
> >>> As far as I can tell, the current version will still, if a ruleset
> >>> looks like this:
> >>>
> >>> /usr read+write
> >>> /usr/lib/ read
> >>>
> >>> reject write access to /usr/lib, right?
> >>
> >> If these two rules are from different layers, then yes it would work as
> >> intended. However, if these rules are from the same layer the path walk
> >> will not stop at /usr/lib but go down to /usr, which grants write
> >> access.
> >
> > I don't see why the code would do what you're saying it does. And an
> > experiment seems to confirm what I said; I checked out landlock-v26,
> > and the behavior I get is:
>
> There is a misunderstanding, I was responding to your proposition to
> modify check_access_path_continue(), not about the behavior of landlock-v26.
>
> >
> > user@vm:~/landlock$ dd if=/dev/null of=/tmp/aaa
> > 0+0 records in
> > 0+0 records out
> > 0 bytes copied, 0.00106365 s, 0.0 kB/s
> > user@vm:~/landlock$ LL_FS_RO='/lib' LL_FS_RW='/' ./sandboxer dd
> > if=/dev/null of=/tmp/aaa
> > 0+0 records in
> > 0+0 records out
> > 0 bytes copied, 0.000491814 s, 0.0 kB/s
> > user@vm:~/landlock$ LL_FS_RO='/tmp' LL_FS_RW='/' ./sandboxer dd
> > if=/dev/null of=/tmp/aaa
> > dd: failed to open '/tmp/aaa': Permission denied
> > user@vm:~/landlock$
> >
> > Granting read access to /tmp prevents writing to it, even though write
> > access was granted to /.
> >
>
> It indeed works like this with landlock-v26. However, with your above
> proposition, it would work like this:
>
> $ LL_FS_RO='/tmp' LL_FS_RW='/' ./sandboxer dd if=/dev/null of=/tmp/aaa
> 0+0 records in
> 0+0 records out
> 0 bytes copied, 0.000187265 s, 0.0 kB/s
>
> …which is not what users would expect I guess. :)

Ah, 

Re: [PATCH v26 07/12] landlock: Support filesystem access-control

2021-01-15 Thread Mickaël Salaün


On 14/01/2021 23:43, Jann Horn wrote:
> On Thu, Jan 14, 2021 at 7:54 PM Mickaël Salaün  wrote:
>> On 14/01/2021 04:22, Jann Horn wrote:
>>> On Wed, Dec 9, 2020 at 8:28 PM Mickaël Salaün  wrote:
 Thanks to the Landlock objects and ruleset, it is possible to identify
 inodes according to a process's domain.  To enable an unprivileged
 process to express a file hierarchy, it first needs to open a directory
 (or a file) and pass this file descriptor to the kernel through
 landlock_add_rule(2).  When checking if a file access request is
 allowed, we walk from the requested dentry to the real root, following
 the different mount layers.  The access to each "tagged" inodes are
 collected according to their rule layer level, and ANDed to create
 access to the requested file hierarchy.  This makes possible to identify
 a lot of files without tagging every inodes nor modifying the
 filesystem, while still following the view and understanding the user
 has from the filesystem.

 Add a new ARCH_EPHEMERAL_INODES for UML because it currently does not
 keep the same struct inodes for the same inodes whereas these inodes are
 in use.

 This commit adds a minimal set of supported filesystem access-control
 which doesn't enable to restrict all file-related actions.  This is the
 result of multiple discussions to minimize the code of Landlock to ease
 review.  Thanks to the Landlock design, extending this access-control
 without breaking user space will not be a problem.  Moreover, seccomp
 filters can be used to restrict the use of syscall families which may
 not be currently handled by Landlock.
>>> [...]
 +static bool check_access_path_continue(
 +   const struct landlock_ruleset *const domain,
 +   const struct path *const path, const u32 access_request,
 +   u64 *const layer_mask)
 +{
>>> [...]
 +   /*
 +* An access is granted if, for each policy layer, at least one 
 rule
 +* encountered on the pathwalk grants the access, regardless of 
 their
 +* position in the layer stack.  We must then check not-yet-seen 
 layers
 +* for each inode, from the last one added to the first one.
 +*/
 +   for (i = 0; i < rule->num_layers; i++) {
 +   const struct landlock_layer *const layer = 
 >layers[i];
 +   const u64 layer_level = BIT_ULL(layer->level - 1);
 +
 +   if (!(layer_level & *layer_mask))
 +   continue;
 +   if ((layer->access & access_request) != access_request)
 +   return false;
 +   *layer_mask &= ~layer_level;
>>>
>>> Hmm... shouldn't the last 5 lines be replaced by the following?
>>>
>>> if ((layer->access & access_request) == access_request)
>>> *layer_mask &= ~layer_level;
>>>
>>> And then, since this function would always return true, you could
>>> change its return type to "void".
>>>
>>>
>>> As far as I can tell, the current version will still, if a ruleset
>>> looks like this:
>>>
>>> /usr read+write
>>> /usr/lib/ read
>>>
>>> reject write access to /usr/lib, right?
>>
>> If these two rules are from different layers, then yes it would work as
>> intended. However, if these rules are from the same layer the path walk
>> will not stop at /usr/lib but go down to /usr, which grants write
>> access.
> 
> I don't see why the code would do what you're saying it does. And an
> experiment seems to confirm what I said; I checked out landlock-v26,
> and the behavior I get is:

There is a misunderstanding, I was responding to your proposition to
modify check_access_path_continue(), not about the behavior of landlock-v26.

> 
> user@vm:~/landlock$ dd if=/dev/null of=/tmp/aaa
> 0+0 records in
> 0+0 records out
> 0 bytes copied, 0.00106365 s, 0.0 kB/s
> user@vm:~/landlock$ LL_FS_RO='/lib' LL_FS_RW='/' ./sandboxer dd
> if=/dev/null of=/tmp/aaa
> 0+0 records in
> 0+0 records out
> 0 bytes copied, 0.000491814 s, 0.0 kB/s
> user@vm:~/landlock$ LL_FS_RO='/tmp' LL_FS_RW='/' ./sandboxer dd
> if=/dev/null of=/tmp/aaa
> dd: failed to open '/tmp/aaa': Permission denied
> user@vm:~/landlock$
> 
> Granting read access to /tmp prevents writing to it, even though write
> access was granted to /.
> 

It indeed works like this with landlock-v26. However, with your above
proposition, it would work like this:

$ LL_FS_RO='/tmp' LL_FS_RW='/' ./sandboxer dd if=/dev/null of=/tmp/aaa
0+0 records in
0+0 records out
0 bytes copied, 0.000187265 s, 0.0 kB/s

…which is not what users would expect I guess. :)


Re: [PATCH v26 07/12] landlock: Support filesystem access-control

2021-01-14 Thread Jann Horn
On Thu, Jan 14, 2021 at 7:54 PM Mickaël Salaün  wrote:
> On 14/01/2021 04:22, Jann Horn wrote:
> > On Wed, Dec 9, 2020 at 8:28 PM Mickaël Salaün  wrote:
> >> Thanks to the Landlock objects and ruleset, it is possible to identify
> >> inodes according to a process's domain.  To enable an unprivileged
> >> process to express a file hierarchy, it first needs to open a directory
> >> (or a file) and pass this file descriptor to the kernel through
> >> landlock_add_rule(2).  When checking if a file access request is
> >> allowed, we walk from the requested dentry to the real root, following
> >> the different mount layers.  The access to each "tagged" inodes are
> >> collected according to their rule layer level, and ANDed to create
> >> access to the requested file hierarchy.  This makes possible to identify
> >> a lot of files without tagging every inodes nor modifying the
> >> filesystem, while still following the view and understanding the user
> >> has from the filesystem.
> >>
> >> Add a new ARCH_EPHEMERAL_INODES for UML because it currently does not
> >> keep the same struct inodes for the same inodes whereas these inodes are
> >> in use.
> >>
> >> This commit adds a minimal set of supported filesystem access-control
> >> which doesn't enable to restrict all file-related actions.  This is the
> >> result of multiple discussions to minimize the code of Landlock to ease
> >> review.  Thanks to the Landlock design, extending this access-control
> >> without breaking user space will not be a problem.  Moreover, seccomp
> >> filters can be used to restrict the use of syscall families which may
> >> not be currently handled by Landlock.
> > [...]
> >> +static bool check_access_path_continue(
> >> +   const struct landlock_ruleset *const domain,
> >> +   const struct path *const path, const u32 access_request,
> >> +   u64 *const layer_mask)
> >> +{
> > [...]
> >> +   /*
> >> +* An access is granted if, for each policy layer, at least one 
> >> rule
> >> +* encountered on the pathwalk grants the access, regardless of 
> >> their
> >> +* position in the layer stack.  We must then check not-yet-seen 
> >> layers
> >> +* for each inode, from the last one added to the first one.
> >> +*/
> >> +   for (i = 0; i < rule->num_layers; i++) {
> >> +   const struct landlock_layer *const layer = 
> >> >layers[i];
> >> +   const u64 layer_level = BIT_ULL(layer->level - 1);
> >> +
> >> +   if (!(layer_level & *layer_mask))
> >> +   continue;
> >> +   if ((layer->access & access_request) != access_request)
> >> +   return false;
> >> +   *layer_mask &= ~layer_level;
> >
> > Hmm... shouldn't the last 5 lines be replaced by the following?
> >
> > if ((layer->access & access_request) == access_request)
> > *layer_mask &= ~layer_level;
> >
> > And then, since this function would always return true, you could
> > change its return type to "void".
> >
> >
> > As far as I can tell, the current version will still, if a ruleset
> > looks like this:
> >
> > /usr read+write
> > /usr/lib/ read
> >
> > reject write access to /usr/lib, right?
>
> If these two rules are from different layers, then yes it would work as
> intended. However, if these rules are from the same layer the path walk
> will not stop at /usr/lib but go down to /usr, which grants write
> access.

I don't see why the code would do what you're saying it does. And an
experiment seems to confirm what I said; I checked out landlock-v26,
and the behavior I get is:

user@vm:~/landlock$ dd if=/dev/null of=/tmp/aaa
0+0 records in
0+0 records out
0 bytes copied, 0.00106365 s, 0.0 kB/s
user@vm:~/landlock$ LL_FS_RO='/lib' LL_FS_RW='/' ./sandboxer dd
if=/dev/null of=/tmp/aaa
0+0 records in
0+0 records out
0 bytes copied, 0.000491814 s, 0.0 kB/s
user@vm:~/landlock$ LL_FS_RO='/tmp' LL_FS_RW='/' ./sandboxer dd
if=/dev/null of=/tmp/aaa
dd: failed to open '/tmp/aaa': Permission denied
user@vm:~/landlock$

Granting read access to /tmp prevents writing to it, even though write
access was granted to /.


Re: [PATCH v26 07/12] landlock: Support filesystem access-control

2021-01-14 Thread Mickaël Salaün


On 14/01/2021 04:22, Jann Horn wrote:
> On Wed, Dec 9, 2020 at 8:28 PM Mickaël Salaün  wrote:
>> Thanks to the Landlock objects and ruleset, it is possible to identify
>> inodes according to a process's domain.  To enable an unprivileged
>> process to express a file hierarchy, it first needs to open a directory
>> (or a file) and pass this file descriptor to the kernel through
>> landlock_add_rule(2).  When checking if a file access request is
>> allowed, we walk from the requested dentry to the real root, following
>> the different mount layers.  The access to each "tagged" inodes are
>> collected according to their rule layer level, and ANDed to create
>> access to the requested file hierarchy.  This makes possible to identify
>> a lot of files without tagging every inodes nor modifying the
>> filesystem, while still following the view and understanding the user
>> has from the filesystem.
>>
>> Add a new ARCH_EPHEMERAL_INODES for UML because it currently does not
>> keep the same struct inodes for the same inodes whereas these inodes are
>> in use.
>>
>> This commit adds a minimal set of supported filesystem access-control
>> which doesn't enable to restrict all file-related actions.  This is the
>> result of multiple discussions to minimize the code of Landlock to ease
>> review.  Thanks to the Landlock design, extending this access-control
>> without breaking user space will not be a problem.  Moreover, seccomp
>> filters can be used to restrict the use of syscall families which may
>> not be currently handled by Landlock.
> [...]
>> +static bool check_access_path_continue(
>> +   const struct landlock_ruleset *const domain,
>> +   const struct path *const path, const u32 access_request,
>> +   u64 *const layer_mask)
>> +{
> [...]
>> +   /*
>> +* An access is granted if, for each policy layer, at least one rule
>> +* encountered on the pathwalk grants the access, regardless of their
>> +* position in the layer stack.  We must then check not-yet-seen 
>> layers
>> +* for each inode, from the last one added to the first one.
>> +*/
>> +   for (i = 0; i < rule->num_layers; i++) {
>> +   const struct landlock_layer *const layer = >layers[i];
>> +   const u64 layer_level = BIT_ULL(layer->level - 1);
>> +
>> +   if (!(layer_level & *layer_mask))
>> +   continue;
>> +   if ((layer->access & access_request) != access_request)
>> +   return false;
>> +   *layer_mask &= ~layer_level;
> 
> Hmm... shouldn't the last 5 lines be replaced by the following?
> 
> if ((layer->access & access_request) == access_request)
> *layer_mask &= ~layer_level;
> 
> And then, since this function would always return true, you could
> change its return type to "void".
> 
> 
> As far as I can tell, the current version will still, if a ruleset
> looks like this:
> 
> /usr read+write
> /usr/lib/ read
> 
> reject write access to /usr/lib, right?

If these two rules are from different layers, then yes it would work as
intended. However, if these rules are from the same layer the path walk
will not stop at /usr/lib but go down to /usr, which grants write
access. This is the reason I wrote it like this and the
layout1.inherit_subset test checks that. I'm updating the documentation
to better explain how an access is checked with one or multiple layers.

Doing this way also enables to stop the path walk earlier, which is the
original purpose of this function.


> 
> 
>> +   }
>> +   return true;
>> +}


Re: [PATCH v26 07/12] landlock: Support filesystem access-control

2021-01-13 Thread Jann Horn
On Wed, Dec 9, 2020 at 8:28 PM Mickaël Salaün  wrote:
> Thanks to the Landlock objects and ruleset, it is possible to identify
> inodes according to a process's domain.  To enable an unprivileged
> process to express a file hierarchy, it first needs to open a directory
> (or a file) and pass this file descriptor to the kernel through
> landlock_add_rule(2).  When checking if a file access request is
> allowed, we walk from the requested dentry to the real root, following
> the different mount layers.  The access to each "tagged" inodes are
> collected according to their rule layer level, and ANDed to create
> access to the requested file hierarchy.  This makes possible to identify
> a lot of files without tagging every inodes nor modifying the
> filesystem, while still following the view and understanding the user
> has from the filesystem.
>
> Add a new ARCH_EPHEMERAL_INODES for UML because it currently does not
> keep the same struct inodes for the same inodes whereas these inodes are
> in use.
>
> This commit adds a minimal set of supported filesystem access-control
> which doesn't enable to restrict all file-related actions.  This is the
> result of multiple discussions to minimize the code of Landlock to ease
> review.  Thanks to the Landlock design, extending this access-control
> without breaking user space will not be a problem.  Moreover, seccomp
> filters can be used to restrict the use of syscall families which may
> not be currently handled by Landlock.
[...]
> +static bool check_access_path_continue(
> +   const struct landlock_ruleset *const domain,
> +   const struct path *const path, const u32 access_request,
> +   u64 *const layer_mask)
> +{
[...]
> +   /*
> +* An access is granted if, for each policy layer, at least one rule
> +* encountered on the pathwalk grants the access, regardless of their
> +* position in the layer stack.  We must then check not-yet-seen 
> layers
> +* for each inode, from the last one added to the first one.
> +*/
> +   for (i = 0; i < rule->num_layers; i++) {
> +   const struct landlock_layer *const layer = >layers[i];
> +   const u64 layer_level = BIT_ULL(layer->level - 1);
> +
> +   if (!(layer_level & *layer_mask))
> +   continue;
> +   if ((layer->access & access_request) != access_request)
> +   return false;
> +   *layer_mask &= ~layer_level;

Hmm... shouldn't the last 5 lines be replaced by the following?

if ((layer->access & access_request) == access_request)
*layer_mask &= ~layer_level;

And then, since this function would always return true, you could
change its return type to "void".


As far as I can tell, the current version will still, if a ruleset
looks like this:

/usr read+write
/usr/lib/ read

reject write access to /usr/lib, right?


> +   }
> +   return true;
> +}


[PATCH v26 07/12] landlock: Support filesystem access-control

2020-12-09 Thread Mickaël Salaün
From: Mickaël Salaün 

Thanks to the Landlock objects and ruleset, it is possible to identify
inodes according to a process's domain.  To enable an unprivileged
process to express a file hierarchy, it first needs to open a directory
(or a file) and pass this file descriptor to the kernel through
landlock_add_rule(2).  When checking if a file access request is
allowed, we walk from the requested dentry to the real root, following
the different mount layers.  The access to each "tagged" inodes are
collected according to their rule layer level, and ANDed to create
access to the requested file hierarchy.  This makes possible to identify
a lot of files without tagging every inodes nor modifying the
filesystem, while still following the view and understanding the user
has from the filesystem.

Add a new ARCH_EPHEMERAL_INODES for UML because it currently does not
keep the same struct inodes for the same inodes whereas these inodes are
in use.

This commit adds a minimal set of supported filesystem access-control
which doesn't enable to restrict all file-related actions.  This is the
result of multiple discussions to minimize the code of Landlock to ease
review.  Thanks to the Landlock design, extending this access-control
without breaking user space will not be a problem.  Moreover, seccomp
filters can be used to restrict the use of syscall families which may
not be currently handled by Landlock.

Cc: Al Viro 
Cc: Anton Ivanov 
Cc: James Morris 
Cc: Jann Horn 
Cc: Jeff Dike 
Cc: Kees Cook 
Cc: Richard Weinberger 
Cc: Serge E. Hallyn 
Signed-off-by: Mickaël Salaün 
---

Changes since v25:
* Move build_check_layer() to ruleset.c, and add built-time checks for
  the fs_access_mask and access variables according to
  _LANDLOCK_ACCESS_FS_MASK.
* Move limits to a dedicated file and rename them:
  _LANDLOCK_ACCESS_FS_LAST and _LANDLOCK_ACCESS_FS_MASK.
* Set build_check_layer() as non-inline to trigger a warning if it is
  not called.
* Use BITS_PER_TYPE() macro.
* Rename function to landlock_add_fs_hooks().
* Cosmetic variable renames.

Changes since v24:
* Use the new struct landlock_rule and landlock_layer to not mix
  accesses from different layers.  Revert "Enforce deterministic
  interleaved path rules" from v24, and fix the layer check.  This
  enables to follow a sane semantic: an access is granted if, for each
  policy layer, at least one rule encountered on the pathwalk grants the
  access, regardless of their position in the layer stack (suggested by
  Jann Horn).  See layout1.interleaved_masked_accesses tests from
  tools/testing/selftests/landlock/fs_test.c for corner cases.
* Add build-time checks for layers.
* Use the new landlock_insert_rule() API.

Changes since v23:
* Enforce deterministic interleaved path rules.  To have consistent
  layered rules, granting access to a path implies that all accesses
  tied to inodes, from the requested file to the real root, must be
  checked.  Otherwise, stacked rules may result to overzealous
  restrictions.  By excluding the ability to add exceptions in the same
  layer (e.g. /a allowed, /a/b denied, and /a/b/c allowed), we get
  deterministic interleaved path rules.  This removes an optimization
  which could be replaced by a proper cache mechanism.  This also
  further simplifies and explain check_access_path_continue().
* Fix memory allocation error handling in landlock_create_object()
  calls.  This prevent to inadvertently hold an inode.
* In get_inode_object(), improve comments, make code more readable and
  move kfree() call out of the lock window.
* Use the simplified landlock_insert_rule() API.

Changes since v22:
* Simplify check_access_path_continue() (suggested by Jann Horn).
* Remove prefetch() call for now (suggested by Jann Horn).
* Fix spelling and remove superfluous comment (spotted by Jann Horn).
* Cosmetic variable renaming.

Changes since v21:
* Rename ARCH_EPHEMERAL_STATES to ARCH_EPHEMERAL_INODES (suggested by
  James Morris).
* Remove the LANDLOCK_ACCESS_FS_CHROOT right because chroot(2) (which
  requires CAP_SYS_CHROOT) doesn't enable to bypass Landlock (as tests
  demonstrate it), and because it is often used by sandboxes, it would
  be counterproductive to forbid it.  This also reduces the code size.
* Clean up documentation.

Changes since v19:
* Fix spelling (spotted by Randy Dunlap).

Changes since v18:
* Remove useless include.
* Fix spelling.

Changes since v17:
* Replace landlock_release_inodes() with security_sb_delete() (requested
  by James Morris).
* Replace struct super_block->s_landlock_inode_refs with the LSM
  infrastructure management of the superblock (requested by James
  Morris).
* Fix mknod restriction with a zero mode (spotted by Vincent Dagonneau).
* Minimize executed code in path_mknod and file_open hooks when the
  current tasks is not sandboxed.
* Remove useless checks on the file pointer and inode in
  hook_file_open() .
* Constify domain pointers.
* Rename inode_landlock() to landlock_inode().
* Import