Re: [RFC PATCH 02/25] kvx: Add ELF-related definitions

2023-01-04 Thread Eric W. Biederman
Yann Sionneau  writes:

> Add ELF-related definitions for kvx, including: EM_KVX,
> AUDIT_ARCH_KVX and NT_KVX_TCA.

Has someone written an SYSVABI architecture specification for
your architecture?

I feel uncomfortable with the linux-kernel headers being the
authoritative place for the ELF abi definitions.

Especially since the linux kernel does not deal with relocations,
and the kernel headers could diverge from the real world and no one
would notice..

I know at least at one point the linux standards base was taking
up the work on collecting up some of these definitions.  I would
be happy if there was anything outside of the linux kernel that
people could refer too.

Eric

> CC: Paul Moore 
> CC: Eric Paris 
> CC: Eric Biederman 
> CC: Kees Cook 
> CC: linux-audit@redhat.com
> CC: linux-ker...@vger.kernel.org
> CC: linux...@kvack.org
> Co-developed-by: Clement Leger 
> Signed-off-by: Clement Leger 
> Signed-off-by: Yann Sionneau 
> ---
>  include/uapi/linux/audit.h  | 1 +
>  include/uapi/linux/elf-em.h | 1 +
>  include/uapi/linux/elf.h| 1 +
>  3 files changed, 3 insertions(+)
>
> diff --git a/include/uapi/linux/audit.h b/include/uapi/linux/audit.h
> index d676ed2b246e..4db7aa3f84c7 100644
> --- a/include/uapi/linux/audit.h
> +++ b/include/uapi/linux/audit.h
> @@ -402,6 +402,7 @@ enum {
>  #define AUDIT_ARCH_HEXAGON   (EM_HEXAGON)
>  #define AUDIT_ARCH_I386  (EM_386|__AUDIT_ARCH_LE)
>  #define AUDIT_ARCH_IA64  
> (EM_IA_64|__AUDIT_ARCH_64BIT|__AUDIT_ARCH_LE)
> +#define AUDIT_ARCH_KVX   
> (EM_KVX|__AUDIT_ARCH_64BIT|__AUDIT_ARCH_LE)
>  #define AUDIT_ARCH_M32R  (EM_M32R)
>  #define AUDIT_ARCH_M68K  (EM_68K)
>  #define AUDIT_ARCH_MICROBLAZE(EM_MICROBLAZE)
> diff --git a/include/uapi/linux/elf-em.h b/include/uapi/linux/elf-em.h
> index ef38c2bc5ab7..9cc348be7f86 100644
> --- a/include/uapi/linux/elf-em.h
> +++ b/include/uapi/linux/elf-em.h
> @@ -51,6 +51,7 @@
>  #define EM_RISCV 243 /* RISC-V */
>  #define EM_BPF   247 /* Linux BPF - in-kernel virtual 
> machine */
>  #define EM_CSKY  252 /* C-SKY */
> +#define EM_KVX   256 /* Kalray VLIW Architecture */
>  #define EM_LOONGARCH 258 /* LoongArch */
>  #define EM_FRV   0x5441  /* Fujitsu FR-V */
>  
> diff --git a/include/uapi/linux/elf.h b/include/uapi/linux/elf.h
> index c7b056af9ef0..49094f3be06c 100644
> --- a/include/uapi/linux/elf.h
> +++ b/include/uapi/linux/elf.h
> @@ -444,6 +444,7 @@ typedef struct elf64_shdr {
>  #define NT_LOONGARCH_LSX 0xa02   /* LoongArch Loongson SIMD Extension 
> registers */
>  #define NT_LOONGARCH_LASX0xa03   /* LoongArch Loongson Advanced SIMD 
> Extension registers */
>  #define NT_LOONGARCH_LBT 0xa04   /* LoongArch Loongson Binary 
> Translation registers */
> +#define NT_KVX_TCA   0x900   /* kvx TCA registers */
>  
>  /* Note types with note name "GNU" */
>  #define NT_GNU_PROPERTY_TYPE_0   5

--
Linux-audit mailing list
Linux-audit@redhat.com
https://listman.redhat.com/mailman/listinfo/linux-audit



Re: [PATCH 00/34] fs: idmapped mounts

2020-10-29 Thread Eric W. Biederman
Lennart Poettering  writes:

> On Do, 29.10.20 10:47, Eric W. Biederman (ebied...@xmission.com) wrote:
>
>> Is that the use case you are looking at removing the need for
>> systemd-homed to avoid chowning after lugging encrypted home directories
>> from one system to another?  Why would it be desirable to avoid the
>> chown?
>
> Yes, I am very interested in seeing Christian's work succeed, for the
> usecase in systemd-homed. In systemd-homed each user gets their own
> private file system, and these fs shall be owned by the user's local
> UID, regardless in which system it is used. The UID should be an
> artifact of the local, individual system in this model, and thus
> the UID on of the same user/home on system A might be picked as 1010
> and on another as 1543, and on a third as 1323, and it shouldn't
> matter. This way, home directories become migratable without having to
> universially sync UID assignments: it doesn't matter anymore what the
> local UID is.
>
> Right now we do a recursive chown() at login time to ensure the home
> dir is properly owned. This has two disadvantages:
>
> 1. It's slow. In particular on large home dirs, it takes a while to go
>through the whole user's homedir tree and chown/adjust ACLs for
>everything.
>
> 2. Because it is so slow we take a shortcut right now: if the
>top-level home dir inode itself is owned by the correct user, we
>skip the recursive chowning. This means in the typical case where a
>user uses the same system most of the time, and thus the UID is
>stable we can avoid the slowness. But this comes at a drawback: if
>the user for some reason ends up with files in their homedir owned
>by an unrelated user, then we'll never notice or readjust.


The classic solution to this problem for removable media are
uid=XXX and gid=XXX mount options.

I suspect a similar solution can apply here.

I don't think you need a solution that requires different kuids
to be able to write to the same filesystem uid.

>> If the goal is to solve fragmented administration of uid assignment I
>> suggest that it might be better to solve the administration problem so
>> that all of the uids of interest get assigned the same way on all of the
>> systems of interest.
>
> Well, the goal is to make things simple and be able to use the home
> dir everywhere without any prior preparation, without central UID
> assignment authority.
>
> The goal is to have a scheme that requires no administration, by
> making the UID management problem go away. Hence, if you suggest
> solving this by having a central administrative authority: this is
> exactly what the model wants to get away from.

For a files that can be accessed by more than a single user this is
fundamentally necessary.  Otherwise group permissions and acls can not
work.  They wind up as meaningless garbage, because without some kind of
synchronization those other users and groups simply can not be
represented.

> Or to say this differently: just because I personally use three
> different computers, I certainly don't want to set up LDAP or sync
> UIDs manually.

If they are single users systems why should you need to?

But if permissions on files are going to be at all meaningful it is
a fundamentally a requirement that there be no confusion about which
party the other parties are talking about.  To the best of my knowledge
syncing uids/usernames between machines is as simple as it can get.

Eric

--
Linux-audit mailing list
Linux-audit@redhat.com
https://www.redhat.com/mailman/listinfo/linux-audit



Re: [PATCH 00/34] fs: idmapped mounts

2020-10-29 Thread Eric W. Biederman
Aleksa Sarai  writes:

> On 2020-10-29, Eric W. Biederman  wrote:
>> Christian Brauner  writes:
>> 
>> > Hey everyone,
>> >
>> > I vanished for a little while to focus on this work here so sorry for
>> > not being available by mail for a while.
>> >
>> > Since quite a long time we have issues with sharing mounts between
>> > multiple unprivileged containers with different id mappings, sharing a
>> > rootfs between multiple containers with different id mappings, and also
>> > sharing regular directories and filesystems between users with different
>> > uids and gids. The latter use-cases have become even more important with
>> > the availability and adoption of systemd-homed (cf. [1]) to implement
>> > portable home directories.
>> 
>> Can you walk us through the motivating use case?
>> 
>> As of this year's LPC I had the distinct impression that the primary use
>> case for such a feature was due to the RLIMIT_NPROC problem where two
>> containers with the same users still wanted different uid mappings to
>> the disk because the users were conflicting with each other because of
>> the per user rlimits.
>> 
>> Fixing rlimits is straight forward to implement, and easier to manage
>> for implementations and administrators.
>
> This is separate to the question of "isolated user namespaces" and
> managing different mappings between containers. This patchset is solving
> the same problem that shiftfs solved -- sharing a single directory tree
> between containers that have different ID mappings. rlimits (nor any of
> the other proposals we discussed at LPC) will help with this problem.

First and foremost: A uid shift on write to a filesystem is a security
bug waiting to happen.  This is especially in the context of facilities
like iouring, that play very agressive games with how process context
makes it to  system calls.

The only reason containers were not immediately exploitable when iouring
was introduced is because the mechanisms are built so that even if
something escapes containment the security properties still apply.
Changes to the uid when writing to the filesystem does not have that
property.  The tiniest slip in containment will be a security issue.

This is not even the least bit theoretical.  I have seem reports of how
shitfs+overlayfs created a situation where anyone could read
/etc/shadow.

If you are going to write using the same uid to disk from different
containers the question becomes why can't those containers configure
those users to use the same kuid?

What fixing rlimits does is it fixes one of the reasons that different
containers could not share the same kuid for users that want to write to
disk with the same uid.


I humbly suggest that it will be more secure, and easier to maintain for
both developers and users if we fix the reasons people want different
containers to have the same user running with different kuids.

If not what are the reasons we fundamentally need the same on-disk user
using multiple kuids in the kernel?

Eric

--
Linux-audit mailing list
Linux-audit@redhat.com
https://www.redhat.com/mailman/listinfo/linux-audit



Re: [PATCH 00/34] fs: idmapped mounts

2020-10-29 Thread Eric W. Biederman
Christian Brauner  writes:

> Hey everyone,
>
> I vanished for a little while to focus on this work here so sorry for
> not being available by mail for a while.
>
> Since quite a long time we have issues with sharing mounts between
> multiple unprivileged containers with different id mappings, sharing a
> rootfs between multiple containers with different id mappings, and also
> sharing regular directories and filesystems between users with different
> uids and gids. The latter use-cases have become even more important with
> the availability and adoption of systemd-homed (cf. [1]) to implement
> portable home directories.

Can you walk us through the motivating use case?

As of this year's LPC I had the distinct impression that the primary use
case for such a feature was due to the RLIMIT_NPROC problem where two
containers with the same users still wanted different uid mappings to
the disk because the users were conflicting with each other because of
the per user rlimits.

Fixing rlimits is straight forward to implement, and easier to manage
for implementations and administrators.



Reading up on systemd-homed it appears to be a way to have encrypted
home directories.  Those home directories can either be encrypted at the
fs or at the block level.  Those home directories appear to have the
goal of being luggable between systems.  If the systems in question
don't have common administration of uids and gids after lugging your
encrypted home directory to another system chowning the files is
required.

Is that the use case you are looking at removing the need for
systemd-homed to avoid chowning after lugging encrypted home directories
from one system to another?  Why would it be desirable to avoid the
chown?


If the goal is to solve fragmented administration of uid assignment I
suggest that it might be better to solve the administration problem so
that all of the uids of interest get assigned the same way on all of the
systems of interest.  To the extent it is possible to solve the uid
assignment problem that would seem to have fewer long term
administrative problems.

Eric

--
Linux-audit mailing list
Linux-audit@redhat.com
https://www.redhat.com/mailman/listinfo/linux-audit



Re: [PATCH 00/34] fs: idmapped mounts

2020-10-29 Thread Eric W. Biederman
Tycho Andersen  writes:

> Hi Eric,
>
> On Thu, Oct 29, 2020 at 10:47:49AM -0500, Eric W. Biederman wrote:
>> Christian Brauner  writes:
>> 
>> > Hey everyone,
>> >
>> > I vanished for a little while to focus on this work here so sorry for
>> > not being available by mail for a while.
>> >
>> > Since quite a long time we have issues with sharing mounts between
>> > multiple unprivileged containers with different id mappings, sharing a
>> > rootfs between multiple containers with different id mappings, and also
>> > sharing regular directories and filesystems between users with different
>> > uids and gids. The latter use-cases have become even more important with
>> > the availability and adoption of systemd-homed (cf. [1]) to implement
>> > portable home directories.
>> 
>> Can you walk us through the motivating use case?
>> 
>> As of this year's LPC I had the distinct impression that the primary use
>> case for such a feature was due to the RLIMIT_NPROC problem where two
>> containers with the same users still wanted different uid mappings to
>> the disk because the users were conflicting with each other because of
>> the per user rlimits.
>> 
>> Fixing rlimits is straight forward to implement, and easier to manage
>> for implementations and administrators.
>
> Our use case is to have the same directory exposed to several
> different containers which each have disjoint ID mappings.

Why do the you have disjoint ID mappings for the users that are writing
to disk with the same ID?

>> Reading up on systemd-homed it appears to be a way to have encrypted
>> home directories.  Those home directories can either be encrypted at the
>> fs or at the block level.  Those home directories appear to have the
>> goal of being luggable between systems.  If the systems in question
>> don't have common administration of uids and gids after lugging your
>> encrypted home directory to another system chowning the files is
>> required.
>> 
>> Is that the use case you are looking at removing the need for
>> systemd-homed to avoid chowning after lugging encrypted home directories
>> from one system to another?  Why would it be desirable to avoid the
>> chown?
>
> Not just systemd-homed, but LXD has to do this,

I asked why the same disk users are assigned different kuids and the
only reason I have heard that LXD does this is the RLIMIT_NPROC problem.

Perhaps there is another reason.

In part this is why I am eager to hear peoples use case, and why I was
trying very hard to make certain we get the requirements.

I want the real requirements though and some thought, not just we did
this and it hurts.  Changning the uids on write is a very hard problem,
and not just in implementating it but also in maintaining and
understanding what is going on.

Eric

--
Linux-audit mailing list
Linux-audit@redhat.com
https://www.redhat.com/mailman/listinfo/linux-audit



Re: [PATCH ghak90 V8 07/16] audit: add contid support for signalling the audit daemon

2020-04-18 Thread Eric W. Biederman
Paul Moore  writes:

> On Thu, Apr 16, 2020 at 4:36 PM Eric W. Biederman  
> wrote:
>> Paul Moore  writes:
>> > On Mon, Mar 30, 2020 at 1:49 PM Richard Guy Briggs  wrote:
>> >> On 2020-03-30 13:34, Paul Moore wrote:
>> >> > On Mon, Mar 30, 2020 at 12:22 PM Richard Guy Briggs  
>> >> > wrote:
>> >> > > On 2020-03-30 10:26, Paul Moore wrote:
>> >> > > > On Mon, Mar 30, 2020 at 9:47 AM Richard Guy Briggs 
>> >> > > >  wrote:
>> >> > > > > On 2020-03-28 23:11, Paul Moore wrote:
>> >> > > > > > On Tue, Mar 24, 2020 at 5:02 PM Richard Guy Briggs 
>> >> > > > > >  wrote:
>> >> > > > > > > On 2020-03-23 20:16, Paul Moore wrote:
>> >> > > > > > > > On Thu, Mar 19, 2020 at 6:03 PM Richard Guy Briggs 
>> >> > > > > > > >  wrote:
>> >> > > > > > > > > On 2020-03-18 18:06, Paul Moore wrote:
>> >
>> > ...
>> >
>> >> > > Well, every time a record gets generated, *any* record gets generated,
>> >> > > we'll need to check for which audit daemons this record is in scope 
>> >> > > and
>> >> > > generate a different one for each depending on the content and whether
>> >> > > or not the content is influenced by the scope.
>> >> >
>> >> > That's the problem right there - we don't want to have to generate a
>> >> > unique record for *each* auditd on *every* record.  That is a recipe
>> >> > for disaster.
>> >> >
>> >> > Solving this for all of the known audit records is not something we
>> >> > need to worry about in depth at the moment (although giving it some
>> >> > casual thought is not a bad thing), but solving this for the audit
>> >> > container ID information *is* something we need to worry about right
>> >> > now.
>> >>
>> >> If you think that a different nested contid value string per daemon is
>> >> not acceptable, then we are back to issuing a record that has only *one*
>> >> contid listed without any nesting information.  This brings us back to
>> >> the original problem of keeping *all* audit log history since the boot
>> >> of the machine to be able to track the nesting of any particular contid.
>> >
>> > I'm not ruling anything out, except for the "let's just completely
>> > regenerate every record for each auditd instance".
>>
>> Paul I am a bit confused about what you are referring to when you say
>> regenerate every record.
>>
>> Are you saying that you don't want to repeat the sequence:
>> audit_log_start(...);
>> audit_log_format(...);
>> audit_log_end(...);
>> for every nested audit daemon?
>
> If it can be avoided yes.  Audit performance is already not-awesome,
> this would make it even worse.

As far as I can see not repeating sequences like that is fundamental
for making this work at all.  Just because only the audit subsystem
should know about one or multiple audit daemons.  Nothing else should
care.

>> Or are you saying that you would like to literraly want to send the same
>> skb to each of the nested audit daemons?
>
> Ideally we would reuse the generated audit messages as much as
> possible.  Less work is better.  That's really my main concern here,
> let's make sure we aren't going to totally tank performance when we
> have a bunch of nested audit daemons.

So I think there are two parts of this answer.  Assuming we are talking
about nesting audit daemons in containers we will have different
rulesets and I expect most of the events for a nested audit daemon won't
be of interest to the outer audit daemon.

Beyond that it should be very straight forward to keep a pointer and
leave the buffer as a scatter gather list until audit_log_end
and translate pids, and rewrite ACIDs attributes in audit_log_end
when we build the final packet.  Either through collaboration with
audit_log_format or a special audit_log command that carefully sets
up the handful of things that need that information.

Hmm.  I am seeing that we send skbs to kauditd and then kauditd
sends those skbs to userspace.  I presume that is primary so that
sending messages to userspace does not block the process being audited.

Plus a little bit so that the retry logic will work.

I think the naive implementation would be to simply have 1 kauditd
per auditd (strictly and audit cont

Re: [PATCH ghak90 V8 07/16] audit: add contid support for signalling the audit daemon

2020-04-16 Thread Eric W. Biederman
Paul Moore  writes:

> On Mon, Mar 30, 2020 at 1:49 PM Richard Guy Briggs  wrote:
>> On 2020-03-30 13:34, Paul Moore wrote:
>> > On Mon, Mar 30, 2020 at 12:22 PM Richard Guy Briggs  
>> > wrote:
>> > > On 2020-03-30 10:26, Paul Moore wrote:
>> > > > On Mon, Mar 30, 2020 at 9:47 AM Richard Guy Briggs  
>> > > > wrote:
>> > > > > On 2020-03-28 23:11, Paul Moore wrote:
>> > > > > > On Tue, Mar 24, 2020 at 5:02 PM Richard Guy Briggs 
>> > > > > >  wrote:
>> > > > > > > On 2020-03-23 20:16, Paul Moore wrote:
>> > > > > > > > On Thu, Mar 19, 2020 at 6:03 PM Richard Guy Briggs 
>> > > > > > > >  wrote:
>> > > > > > > > > On 2020-03-18 18:06, Paul Moore wrote:
>
> ...
>
>> > > Well, every time a record gets generated, *any* record gets generated,
>> > > we'll need to check for which audit daemons this record is in scope and
>> > > generate a different one for each depending on the content and whether
>> > > or not the content is influenced by the scope.
>> >
>> > That's the problem right there - we don't want to have to generate a
>> > unique record for *each* auditd on *every* record.  That is a recipe
>> > for disaster.
>> >
>> > Solving this for all of the known audit records is not something we
>> > need to worry about in depth at the moment (although giving it some
>> > casual thought is not a bad thing), but solving this for the audit
>> > container ID information *is* something we need to worry about right
>> > now.
>>
>> If you think that a different nested contid value string per daemon is
>> not acceptable, then we are back to issuing a record that has only *one*
>> contid listed without any nesting information.  This brings us back to
>> the original problem of keeping *all* audit log history since the boot
>> of the machine to be able to track the nesting of any particular contid.
>
> I'm not ruling anything out, except for the "let's just completely
> regenerate every record for each auditd instance".

Paul I am a bit confused about what you are referring to when you say
regenerate every record.

Are you saying that you don't want to repeat the sequence:
audit_log_start(...);
audit_log_format(...);
audit_log_end(...);
for every nested audit daemon?

Or are you saying that you would like to literraly want to send the same
skb to each of the nested audit daemons?

Or are you thinking of something else?

Eric


--
Linux-audit mailing list
Linux-audit@redhat.com
https://www.redhat.com/mailman/listinfo/linux-audit



Re: [PATCH ghak90 V6 02/10] audit: add container id

2019-07-19 Thread Eric W. Biederman
Richard Guy Briggs  writes:

> Implement the proc fs write to set the audit container identifier of a
> process, emitting an AUDIT_CONTAINER_OP record to document the event.
>
> This is a write from the container orchestrator task to a proc entry of
> the form /proc/PID/audit_containerid where PID is the process ID of the
> newly created task that is to become the first task in a container, or
> an additional task added to a container.
>
> The write expects up to a u64 value (unset: 18446744073709551615).
>
> The writer must have capability CAP_AUDIT_CONTROL.
>
> This will produce a record such as this:
>   type=CONTAINER_OP msg=audit(2018-06-06 12:39:29.636:26949) : op=set 
> opid=2209 contid=123456 old-contid=18446744073709551615 pid=628 auid=root 
> uid=root tty=ttyS0 ses=1 
> subj=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 comm=bash 
> exe=/usr/bin/bash res=yes
>
> The "op" field indicates an initial set.  The "pid" to "ses" fields are
> the orchestrator while the "opid" field is the object's PID, the process
> being "contained".  New and old audit container identifier values are
> given in the "contid" fields, while res indicates its success.
>
> It is not permitted to unset the audit container identifier.
> A child inherits its parent's audit container identifier.

Why get proc involved in this?  I know it more or less fits as
this is about a process and it's descendants.  But this seems to
encouarge being able to read this value, and being able to read
this value seems to encourage misuse.

So I am not of fan of using proc for this.

> Please see the github audit kernel issue for the main feature:
>   https://github.com/linux-audit/audit-kernel/issues/90
> Please see the github audit userspace issue for supporting additions:
>   https://github.com/linux-audit/audit-userspace/issues/51
> Please see the github audit testsuiite issue for the test case:
>   https://github.com/linux-audit/audit-testsuite/issues/64
> Please see the github audit wiki for the feature overview:
>   https://github.com/linux-audit/audit-kernel/wiki/RFE-Audit-Container-ID
>
> Signed-off-by: Richard Guy Briggs 
> Acked-by: Serge Hallyn 
> Acked-by: Steve Grubb 
> Acked-by: Neil Horman 
> Reviewed-by: Ondrej Mosnacek 
> ---
>  fs/proc/base.c | 36 
>  include/linux/audit.h  | 25 +
>  include/uapi/linux/audit.h |  2 ++
>  kernel/audit.c | 69 
> ++
>  kernel/audit.h |  1 +
>  kernel/auditsc.c   |  4 +++
>  6 files changed, 137 insertions(+)
>
> diff --git a/fs/proc/base.c b/fs/proc/base.c
> index ddef482f1334..43fd0c4b87de 100644
> --- a/fs/proc/base.c
> +++ b/fs/proc/base.c
> @@ -1294,6 +1294,40 @@ static ssize_t proc_sessionid_read(struct file * file, 
> char __user * buf,
>   .read   = proc_sessionid_read,
>   .llseek = generic_file_llseek,
>  };
> +
> +static ssize_t proc_contid_write(struct file *file, const char __user *buf,
> +size_t count, loff_t *ppos)
> +{
> + struct inode *inode = file_inode(file);
> + u64 contid;
> + int rv;
> + struct task_struct *task = get_proc_task(inode);
> +
> + if (!task)
> + return -ESRCH;
> + if (*ppos != 0) {
> + /* No partial writes. */
> + put_task_struct(task);
> + return -EINVAL;
> + }
> +
> + rv = kstrtou64_from_user(buf, count, 10, &contid);
> + if (rv < 0) {
> + put_task_struct(task);
> + return rv;
> + }
> +
> + rv = audit_set_contid(task, contid);
> + put_task_struct(task);
> + if (rv < 0)
> + return rv;
> + return count;
> +}
> +
> +static const struct file_operations proc_contid_operations = {
> + .write  = proc_contid_write,
> + .llseek = generic_file_llseek,
> +};
>  #endif
>  
>  #ifdef CONFIG_FAULT_INJECTION
> @@ -3033,6 +3067,7 @@ static int proc_stack_depth(struct seq_file *m, struct 
> pid_namespace *ns,
>  #ifdef CONFIG_AUDIT
>   REG("loginuid",   S_IWUSR|S_IRUGO, proc_loginuid_operations),
>   REG("sessionid",  S_IRUGO, proc_sessionid_operations),
> + REG("audit_containerid", S_IWUSR, proc_contid_operations),
>  #endif
>  #ifdef CONFIG_FAULT_INJECTION
>   REG("make-it-fail", S_IRUGO|S_IWUSR, proc_fault_inject_operations),
> @@ -3431,6 +3466,7 @@ static int proc_tid_comm_permission(struct inode 
> *inode, int mask)
>  #ifdef CONFIG_AUDIT
>   REG("loginuid",  S_IWUSR|S_IRUGO, proc_loginuid_operations),
>   REG("sessionid",  S_IRUGO, proc_sessionid_operations),
> + REG("audit_containerid", S_IWUSR, proc_contid_operations),
>  #endif
>  #ifdef CONFIG_FAULT_INJECTION
>   REG("make-it-fail", S_IRUGO|S_IWUSR, proc_fault_inject_operations),
> diff --git a/include/linux/audit.h b/include/linux/audit.h
> index bde346e73f0c..301337776193 100644
> --- a/include/linux/audit.h
> +++ b/include/linux/au

Re: [PATCH ghak90 V6 03/10] audit: read container ID of a process

2019-07-19 Thread Eric W. Biederman
Richard Guy Briggs  writes:

> Add support for reading the audit container identifier from the proc
> filesystem.
>
> This is a read from the proc entry of the form
> /proc/PID/audit_containerid where PID is the process ID of the task
> whose audit container identifier is sought.
>
> The read expects up to a u64 value (unset: 18446744073709551615).
>
> This read requires CAP_AUDIT_CONTROL.

This scares me.As this seems to make it easy to reuse an audit
containerid for non-audit purporses.

I would think it would be safer and easier to poke audit and ask it to
log a message with your audit container id.

Eric


> Signed-off-by: Richard Guy Briggs 
> Acked-by: Serge Hallyn 
> Acked-by: Neil Horman 
> Reviewed-by: Ondrej Mosnacek 
> ---
>  fs/proc/base.c | 25 ++---
>  1 file changed, 22 insertions(+), 3 deletions(-)
>
> diff --git a/fs/proc/base.c b/fs/proc/base.c
> index 43fd0c4b87de..acc70239d0cb 100644
> --- a/fs/proc/base.c
> +++ b/fs/proc/base.c
> @@ -1211,7 +1211,7 @@ static ssize_t oom_score_adj_write(struct file *file, 
> const char __user *buf,
>  };
>  
>  #ifdef CONFIG_AUDIT
> -#define TMPBUFLEN 11
> +#define TMPBUFLEN 21
>  static ssize_t proc_loginuid_read(struct file * file, char __user * buf,
> size_t count, loff_t *ppos)
>  {
> @@ -1295,6 +1295,24 @@ static ssize_t proc_sessionid_read(struct file * file, 
> char __user * buf,
>   .llseek = generic_file_llseek,
>  };
>  
> +static ssize_t proc_contid_read(struct file *file, char __user *buf,
> +   size_t count, loff_t *ppos)
> +{
> + struct inode *inode = file_inode(file);
> + struct task_struct *task = get_proc_task(inode);
> + ssize_t length;
> + char tmpbuf[TMPBUFLEN];
> +
> + if (!task)
> + return -ESRCH;
> + /* if we don't have caps, reject */
> + if (!capable(CAP_AUDIT_CONTROL))
> + return -EPERM;
> + length = scnprintf(tmpbuf, TMPBUFLEN, "%llu", audit_get_contid(task));
> + put_task_struct(task);
> + return simple_read_from_buffer(buf, count, ppos, tmpbuf, length);
> +}
> +
>  static ssize_t proc_contid_write(struct file *file, const char __user *buf,
>  size_t count, loff_t *ppos)
>  {
> @@ -1325,6 +1343,7 @@ static ssize_t proc_contid_write(struct file *file, 
> const char __user *buf,
>  }
>  
>  static const struct file_operations proc_contid_operations = {
> + .read   = proc_contid_read,
>   .write  = proc_contid_write,
>   .llseek = generic_file_llseek,
>  };
> @@ -3067,7 +3086,7 @@ static int proc_stack_depth(struct seq_file *m, struct 
> pid_namespace *ns,
>  #ifdef CONFIG_AUDIT
>   REG("loginuid",   S_IWUSR|S_IRUGO, proc_loginuid_operations),
>   REG("sessionid",  S_IRUGO, proc_sessionid_operations),
> - REG("audit_containerid", S_IWUSR, proc_contid_operations),
> + REG("audit_containerid", S_IWUSR|S_IRUSR, proc_contid_operations),
>  #endif
>  #ifdef CONFIG_FAULT_INJECTION
>   REG("make-it-fail", S_IRUGO|S_IWUSR, proc_fault_inject_operations),
> @@ -3466,7 +3485,7 @@ static int proc_tid_comm_permission(struct inode 
> *inode, int mask)
>  #ifdef CONFIG_AUDIT
>   REG("loginuid",  S_IWUSR|S_IRUGO, proc_loginuid_operations),
>   REG("sessionid",  S_IRUGO, proc_sessionid_operations),
> - REG("audit_containerid", S_IWUSR, proc_contid_operations),
> + REG("audit_containerid", S_IWUSR|S_IRUSR, proc_contid_operations),
>  #endif
>  #ifdef CONFIG_FAULT_INJECTION
>   REG("make-it-fail", S_IRUGO|S_IWUSR, proc_fault_inject_operations),


Re: [PATCH ghak90 V6 02/10] audit: add container id

2019-07-19 Thread Eric W. Biederman
Paul Moore  writes:

> On Wed, Jul 17, 2019 at 8:52 PM Richard Guy Briggs  wrote:
>> On 2019-07-16 19:30, Paul Moore wrote:
>
> ...
>
>> > We can trust capable(CAP_AUDIT_CONTROL) for enforcing audit container
>> > ID policy, we can not trust ns_capable(CAP_AUDIT_CONTROL).
>>
>> Ok.  So does a process in a non-init user namespace have two (or more)
>> sets of capabilities stored in creds, one in the init_user_ns, and one
>> in current_user_ns?  Or does it get stripped of all its capabilities in
>> init_user_ns once it has its own set in current_user_ns?  If the former,
>> then we can use capable().  If the latter, we need another mechanism, as
>> you have suggested might be needed.
>
> Unfortunately I think the problem is that ultimately we need to allow
> any container orchestrator that has been given privileges to manage
> the audit container ID to also grant that privilege to any of the
> child process/containers it manages.  I don't believe we can do that
> with capabilities based on the code I've looked at, and the
> discussions I've had, but if you find a way I would leave to hear it.

>> If some random unprivileged user wants to fire up a container
>> orchestrator/engine in his own user namespace, then audit needs to be
>> namespaced.  Can we safely discard this scenario for now?
>
> I think the only time we want to allow a container orchestrator to
> manage the audit container ID is if it has been granted that privilege
> by someone who has that privilege already.  In the zero-container, or
> single-level of containers, case this is relatively easy, and we can
> accomplish it using CAP_AUDIT_CONTROL as the privilege.  If we start
> nesting container orchestrators it becomes more complicated as we need
> to be able to support granting and inheriting this privilege in a
> manner; this is why I suggested a new mechanism *may* be necessary.


Let me segway a bit and see if I can get this conversation out of the
rut it seems to have drifted into.

Unprivileged containers and nested containers exist today and are going
to become increasingly common.  Let that be a given.

As I recall the interesting thing for audit to log is actions by
privileged processes.  Audit can log more but generally configuring
logging by of the actions of unprivileged users is effectively a self
DOS.

So I think the initial implementation can safely ignore actions of
nested containers and unprivileged containers because you don't care
about their actions. 

If we start allow running audit in a container then we need to deal with
all of the nesting issues but until then I don't think you folks care.

Or am I wrong.  Do the requirements for securely auditing things from
the kernel care about the actions of unprivileged users?

Eric


Re: [PATCH ghak90 V6 02/10] audit: add container id

2019-07-19 Thread Eric W. Biederman
Richard Guy Briggs  writes:

> On 2019-07-16 19:30, Paul Moore wrote:
>> On Tue, Jul 16, 2019 at 6:03 PM Richard Guy Briggs  wrote:
>> > On 2019-07-15 17:04, Paul Moore wrote:
>> > > On Mon, Jul 8, 2019 at 2:06 PM Richard Guy Briggs  
>> > > wrote:
>> 
>> > > > At this point I would say we are at an impasse unless we trust
>> > > > ns_capable() or we implement audit namespaces.
>> > >
>> > > I'm not sure how we can trust ns_capable(), but if you can think of a
>> > > way I would love to hear it.  I'm also not sure how namespacing audit
>> > > is helpful (see my above comments), but if you think it is please
>> > > explain.
>> >
>> > So if we are not namespacing, why do we not trust capabilities?
>> 
>> We can trust capable(CAP_AUDIT_CONTROL) for enforcing audit container
>> ID policy, we can not trust ns_capable(CAP_AUDIT_CONTROL).
>
> Ok.  So does a process in a non-init user namespace have two (or more)
> sets of capabilities stored in creds, one in the init_user_ns, and one
> in current_user_ns?  Or does it get stripped of all its capabilities in
> init_user_ns once it has its own set in current_user_ns?  If the former,
> then we can use capable().  If the latter, we need another mechanism, as
> you have suggested might be needed.

The latter.  There is only one set of capabilities and it is in the
processes current user namespace.

Eric


Re: [PATCH ghak111 V1] audit: deliver siginfo regarless of syscall

2019-04-09 Thread Eric W. Biederman


Minor nit about the description of this patch (as I presume it will need
to respun).

You are talking about audit signal information.  You are not talking
about struct siginfo (aka siginfo_t).  The structure siginfo_t is part
of posix and userspace ABI and is part of the stack frame for a
delivered signal.

The summary had me thinking you were messing with when siginfo is
collected and delivered to userspace.

Richard Guy Briggs  writes:

> On 2019-04-09 17:37, Steve Grubb wrote:
>> On Tue, 9 Apr 2019 10:02:59 -0400
>> Richard Guy Briggs  wrote:
>> 
>> > On 2019-04-09 08:01, Steve Grubb wrote:
>> > > On Mon,  8 Apr 2019 23:52:29 -0400 Richard Guy Briggs
>> > >  wrote:  
>> > > > When a process signals the audit daemon (shutdown, rotate, resume,
>> > > > reconfig) but syscall auditing is not enabled, we still want to
>> > > > know the identity of the process sending the signal to the audit
>> > > > daemon.  
>> > > 
>> > > Why? If syscall auditing is disabled, then there is no requirement
>> > > to provide anything. What is the real problem that you are seeing?  
>> > 
>> > Shutdown messages with -1 in them rather than the real values.
>> 
>> OK. We can fix that by patching auditd to see if auditing is enabled
>> before requesting signal info. If auditing is disabled, the proper
>> action is for the kernel to ignore any audit userspace messages except
>> the configuration commands.
>
> If auditing is disabled in the kernel, none of this is trackable.  It is
> for those as yet unsupported arches that can run audit enabled but
> without auditsyscall support.
>
> Here's a current sample with CONFIG_AUDIT and ~CONFIG_AUDITSYSCALL
> configured, note "auid=" and "pid=":
>
>   killall -HUP auditd
>   type=DAEMON_CONFIG msg=audit(2019-04-09 11:37:04.508:3266) 
> op=reconfigure state=changed auid=unset pid=-1 subj=? res=success
>
>   killall -TERM auditd
>   type=DAEMON_END msg=audit(2019-04-09 11:51:50.441:5709) : op=terminate 
> auid=unset pid=-1 subj=? res=success 
>   
> and the same with the patch applied:
>
>   killall -HUP auditd
>   type=DAEMON_CONFIG msg=audit(2019-04-09 11:42:40.851:8924) 
> op=reconfigure state=changed auid=root pid=652 subj=? res=success
>
>   killall -TERM auditd
>   type=DAEMON_END msg=audit(2019-04-09 11:51:50.441:5709) : op=terminate 
> auid=root pid=652 subj=? res=success 
>
> The USR1 "rotate" and USR2 "resume" log messages need to be fixed in
> userspace.

Hmm.  You mention -1 as beeing a problem.  You don't say if auid is a
concern.

It looks like all you care about is the sending processes pid.  That
information is saved in the sending processes siginfo.  As such you may
be able to remove some of the magic from the code by looking at the
siginfo of the signal.

Why it appears the kernel is logging when userspace receives a signal is
beyond me so I don't know using siginfo makes sense.  I just figure I
will toss it out there in case it shakes some ideas loose.

Eric

--
Linux-audit mailing list
Linux-audit@redhat.com
https://www.redhat.com/mailman/listinfo/linux-audit


Re: [RFC PATCH ghak32 V2 13/13] debug audit: read container ID of a process

2018-05-21 Thread Eric W. Biederman
Steve Grubb  writes:

> On Friday, March 16, 2018 5:00:40 AM EDT Richard Guy Briggs wrote:
>> Add support for reading the container ID from the proc filesystem.
>
> I think this could be useful in general. Please consider this to be part of 
> the full patch set and not something merely used to debug the patches.

Only with an audit specific name.

As it is:

Nacked-by: "Eric W. Biederman" 

The truth is the containerid name really stinks and is quite confusing
and does not imply that the label applies only to audit.  And little
things like this make me extremely uncofortable with it.

Eric

--
Linux-audit mailing list
Linux-audit@redhat.com
https://www.redhat.com/mailman/listinfo/linux-audit


Re: RFC(V3): Audit Kernel Container IDs

2018-01-09 Thread Eric W. Biederman

Please let's have a description of the problem you are trying to solve.

A proposed solution without talking about the problem space is useless.
Any proposed solution could potentially work.

I know to these exist.  There is motivation for your work.
What is the motivation?
What problem are you trying to solve?

In particular what information are you trying to get into logs that you
can not get into the logs today?

I am going to try to give this the attention it deserves but right now I
am having to deal with half thought out patches for information leaks
from speculative code paths, so I won't be able to give this much
attention for a little bit.

Eric

--
Linux-audit mailing list
Linux-audit@redhat.com
https://www.redhat.com/mailman/listinfo/linux-audit


Re: RFC(v2): Audit Kernel Container IDs

2017-10-19 Thread Eric W. Biederman
Paul Moore  writes:

> On Wed, Oct 18, 2017 at 8:43 PM, Eric W. Biederman
>  wrote:
>> Aleksa Sarai  writes:
>>>>> The security implications are that anything that can change the label
>>>>> could also hide itself and its doings from the audit system and thus
>>>>> would be used as a means to evade detection.  I actually think this
>>>>> means the label should be write once (once you've set it, you can't
>>>>> change it) ...
>>>>
>>>> Richard and I have talked about a write once approach, but the
>>>> thinking was that you may want to allow a nested container
>>>> orchestrator (Why? I don't know, but people always want to do the
>>>> craziest things.) and a write-once policy makes that impossible.  If
>>>> we punt on the nested orchestrator, I believe we can seriously think
>>>> about a write-once policy to simplify things.
>>>
>>> Nested containers are a very widely used use-case (see LXC system 
>>> containers,
>>> inside of which people run other container runtimes). So I would definitely
>>> consider it something that "needs to be supported in some way". While the 
>>> LXC
>>> guys might be a *tad* crazy, the use-case isn't. :P
>
> No worries, we're all a little crazy in our own special ways ;)
>
> Kidding aside, thanks for explaining the use case.
>
>> Of course some of that gets to running auditd inside a container which
>> we don't have yet either.
>>
>> So I think to start it is perfectly fine to figure out the non-nested
>> case first and what makes sense there.  Then to sort out the nested
>> container case.
>>
>> The solution might be that a process gets at most one id per ``audit
>> namespace''.
>
> In an attempt to stay on-topic, let's try to stick with "audit
> container ID" or "container ID" if you must.  I really want to avoid
> the term "audit namespace" simply because the term "namespace" implies
> some things which we aren't planning on doing.

This is 100% on topic.  I am saying that unless we are planing to have
auditd running in a container with it's own set of rules you probably
don't care about nested containers.  Last time I heard a discussion
about that the term in use was audit namespace.   So I was referring to
that support when I said audit namespace, even if the end result only
loosely fits the term namespace.

I could be wrong of course.  I don't fully understand what is driving
the desire to connect audit and containers.  But my naive guess is that
one from an audit perspective you don't care about nested containers
unless there is also a nested auditd who is looking at it from a nested
perspective.

So far we have established with the term container that we are talking
about a running instance of processes, not a filesystem instance that
Docker and friends ship around.   Beyond that I am not certain what you
care about.

Eric

--
Linux-audit mailing list
Linux-audit@redhat.com
https://www.redhat.com/mailman/listinfo/linux-audit


Re: RFC(v2): Audit Kernel Container IDs

2017-10-18 Thread Eric W. Biederman
Aleksa Sarai  writes:

>>> The security implications are that anything that can change the label
>>> could also hide itself and its doings from the audit system and thus
>>> would be used as a means to evade detection.  I actually think this
>>> means the label should be write once (once you've set it, you can't
>>> change it) ...
>>
>> Richard and I have talked about a write once approach, but the
>> thinking was that you may want to allow a nested container
>> orchestrator (Why? I don't know, but people always want to do the
>> craziest things.) and a write-once policy makes that impossible.  If
>> we punt on the nested orchestrator, I believe we can seriously think
>> about a write-once policy to simplify things.
>
> Nested containers are a very widely used use-case (see LXC system containers,
> inside of which people run other container runtimes). So I would definitely
> consider it something that "needs to be supported in some way". While the LXC
> guys might be a *tad* crazy, the use-case isn't. :P

Of course some of that gets to running auditd inside a container which
we don't have yet either.

So I think to start it is perfectly fine to figure out the non-nested
case first and what makes sense there.  Then to sort out the nested
container case.

The solution might be that a process gets at most one id per ``audit
namespace''.

Eric

--
Linux-audit mailing list
Linux-audit@redhat.com
https://www.redhat.com/mailman/listinfo/linux-audit


Re: RFC(v2): Audit Kernel Container IDs

2017-10-12 Thread Eric W. Biederman
Richard Guy Briggs  writes:

> A namespace cannot directly migrate from one container to another but
> could be assigned to a newly spawned container.  A namespace can be
> moved from one container to another indirectly by having that namespace
> used in a second process in another container and then ending all the
> processes in the first container.

Ugh no.  The semantics here are way too mushy.  We need a clean crisp
unambiguous definition or it will be impossible to get this correct and
impossible to use for any security purpose.

I understand the challenge.  Some of the container managers share
namespaces between containers.  Leading to things that are not really
contained.

Please make this concept like an indellibale die.  Once you are stained
with it you can not escape.  If you don't meet all of the criteria you
aren't stained.

The justification that I heard, and that seems legitimate is that it is
not timely and it is hard to make the connection between the distinct
unshare, setns, and clone events and what is happening in the kernel.

With that justification definitely the network namespace needs to be
stained if it is appropriate.

I also don't see why this can't be a special dedicated audit message.
I just looked at the code in the kernel and nlmsg_type is a u16.  There
are only a handful of audit message types defined.  There is absolutely
no reason to bring proc into this.

I have the same reservation as the others about defining a new cap for
this.  It should be enough to make setting the container id a one time
thing for a set of processes and namespaces.

If this is going to be security it needs to be very simple and very well 
defined.

Eric

--
Linux-audit mailing list
Linux-audit@redhat.com
https://www.redhat.com/mailman/listinfo/linux-audit


Re: RFC: Audit Kernel Container IDs

2017-09-19 Thread Eric W. Biederman
Richard Guy Briggs  writes:

> On 2017-09-14 12:33, Eric W. Biederman wrote:
>> Richard Guy Briggs  writes:
>> 
>> > The trigger is a pseudo filesystem (proc, since PID tree already exists)
>> > write of a u64 representing the container ID to a file representing a
>> > process that will become the first process in a new container.
>> > This might place restrictions on mount namespaces required to define a
>> > container, or at least careful checking of namespaces in the kernel to
>> > verify permissions of the orchestrator so it can't change its own
>> > container ID.
>> 
>> Why a u64?
>
> u32 will roll too quickly.  UUID is large enough that it adds
> significantly to audit record bandwidth.  I'd prefer u64, but can look
> at the difference of accommodating a UUID...

I was imagining a string might be better.  As for the purposes of audit
it is just a byte string you regurgitate.

>> Why a proc filesystem write and not a magic audit message?
>
> A magic audit message requires CAP_AUDIT_WRITE, which we'd like to use
> sparingly.  Given that orchestrators will already require it to send
> the mandatory AUDIT_VIRT_*, this doesn't seem like an unreasonable burden.
>
> I was originally leaning towards an audit message trigger or a syscall.
>
>> I don't like the fact that the proc filesystem entry is likely going to
>> be readable and abusable by non-audit contexts?
>
> This proposal wasn't going to start with that link being readable, but
> its filesystem structure and link names would be, perhaps giving away
> too much already.
>
> I think we will need to find a way for the orchestrator or one of its
> authorized agents to read this information while blocking reads from
> unauthorized agents, otherwise this would be of very limited use.

Something that is set only for future audit messages seems reasonable.
Once you start reading this from something other than audit messages I
get neverous, that people will use this beyond audit for things it is
not intended for.

>> Why the ability to change the containerid?  What is the use case you are
>> thinking of there?
>
> This was covered in the end of the conversation with Paul Moore (that
> maybe you got tired reading?)

I have not had time to review everything.  As I was busy preparing for my
wedding and am now in the middle of my honeymoon.

> I'd originally proposed having it write
> once, but Paul figured there was no good reason to restrict it and leave
> that decision up to the orchestrator.  The use case would be adding
> other processes to a container, but it could be argued all additional
> processes should be spawned by the first process in a container.

I see two cases here:
a) Nested containers
b) Inject processes via something like nsenter into a container.

In case a) you have to figure out what to do with nested containers
and that does seem to be a legitimate case for a double write.  Arguably
with the restriction that you must specify a more nested label.

In case b) which you seem to be referring to it would be a process
created by the container manager outside the container that has no
container label.  At which point there is not a need for a double write.

So my recommendation is to not support double writes until you support
nested containers.

Eric

--
Linux-audit mailing list
Linux-audit@redhat.com
https://www.redhat.com/mailman/listinfo/linux-audit


Re: RFC: Audit Kernel Container IDs

2017-09-14 Thread Eric W. Biederman
Richard Guy Briggs  writes:

> The trigger is a pseudo filesystem (proc, since PID tree already exists)
> write of a u64 representing the container ID to a file representing a
> process that will become the first process in a new container.
> This might place restrictions on mount namespaces required to define a
> container, or at least careful checking of namespaces in the kernel to
> verify permissions of the orchestrator so it can't change its own
> container ID.

Why a u64?

Why a proc filesystem write and not a magic audit message?
I don't like the fact that the proc filesystem entry is likely going to
be readable and abusable by non-audit contexts?

Why the ability to change the containerid?  What is the use case you are
thinking of there?

Eric

--
Linux-audit mailing list
Linux-audit@redhat.com
https://www.redhat.com/mailman/listinfo/linux-audit


Re: [PATCH 6/6 RFC] netfilter: add audit netns ID

2017-05-24 Thread Eric W. Biederman
Richard Guy Briggs  writes:

> On 2017-05-24 19:31, Pablo Neira Ayuso wrote:
>> Cc'ing Eric Biederman.
>> 
>> On Thu, May 18, 2017 at 01:21:52PM -0400, Richard Guy Briggs wrote:
>> > diff --git a/net/bridge/netfilter/ebtables.c 
>> > b/net/bridge/netfilter/ebtables.c
>> > index 59b63a8..0f77b2a 100644
>> > --- a/net/bridge/netfilter/ebtables.c
>> > +++ b/net/bridge/netfilter/ebtables.c
>> > @@ -27,6 +27,7 @@
>> >  #include 
>> >  #include 
>> >  #include 
>> > +#define PROC_DYNAMIC_FIRST 0xF000U
>> >  #include 
>> >  /* needed for logical [in,out]-dev filtering */
>> >  #include "../br_private.h"
>> > @@ -1075,7 +1076,8 @@ static int do_replace_finish(struct net *net, struct 
>> > ebt_replace *repl,
>> >ab = audit_log_start(current->audit_context, GFP_KERNEL,
>> > AUDIT_NETFILTER_CFG);
>> >if (ab) {
>> > -  audit_log_format(ab, "op=replace family=%u 
>> > table=%s entries=%u",
>> > +  audit_log_format(ab, "op=replace net=%u 
>> > family=%u table=%s entries=%u",
>> > +   net->ns.inum - 
>> > PROC_DYNAMIC_FIRST,
>> 
>> IIRC, there was a discussion on exposing netns i-node number to
>> userspace time ago on netdev and Eric Biederman was not happy about
>> this?
>
> He was not happy about it being exposed in the /proc filesystem.  We've
> been talking since then and while we've not come to a definitive
> conclusion there is a communication channel open.
>
> This is more of an RFC patch than the rest of this set and I didn't
> seriously expect this one to be accepted, I did want to present the idea
> to see if there were concerns or better ideas generated how to
> differentiate this record from a seemingly identical one.  The only
> other ID would be the network namespace' struct pointer.
>
> At this stage, one thing that is missing is a device number to qualify
> this namespace ID.
>
> Once I started printing the namespace proc inode number (minus the
> starting offset) in decimal, it was very clear what was happenning and
> seemed worth sharing that debugging tool patch.

If the appropriate device number and full inode number is included I
don't have any deep problems with the idea.  I don't like the bare inode
number as we have had in the past and may in the future have these inode
numbers in multiple filesystems so the inode number by itself is not
unique.

Eric

--
Linux-audit mailing list
Linux-audit@redhat.com
https://www.redhat.com/mailman/listinfo/linux-audit


Re: [PATCH v4 0/3] perf: add support for analyzing events for containers

2017-01-11 Thread Eric W. Biederman
Aravinda Prasad  writes:

> On Wednesday 04 January 2017 02:34 PM, Krister Johansen wrote:
>> On Tue, Jan 03, 2017 at 04:57:54PM +0530, Hari Bathini wrote:
>>> On Thursday 29 December 2016 07:11 AM, Krister Johansen wrote:
 On Fri, Dec 16, 2016 at 12:06:55AM +0530, Hari Bathini wrote:
> This patch-set overcomes this limitation by using cgroup identifier as
> container unique identifier. A new PERF_RECORD_NAMESPACES event that
> records namespaces related info is introduced, from which the cgroup
> namespace's device & inode numbers are used as cgroup identifier. This
> is based on the assumption that each container is created with it's own
> cgroup namespace allowing assessment/analysis of multiple containers
> using cgroup identifier.
 Why choose cgroups when the kernel dispenses namespace-unique
 identifiers. Cgroup membership can be arbitrary.  Moreover, cgroup and
>>>
>>> Agreed. But doesn't that hold for any other namespace or a combination
>>> of namespaces as well?
>> 
>> I guess that's part of my concern.  There is no container-unique
>> identifier on the system, since the notion of containers is a construct
>> of higer-level software.  
>
> I wish we had a container-unique identifier. A container-unique
> identifier will make things a lot more better, not just for
> container-aware tracing but for audit subsystem as well.
>
> https://lwn.net/Articles/699819/#Comments

Something like the audit login id might be useful for some things, but
don't expect it to cover all containers, or all usecases so something
like that would need a more specific name than container id.

Any such identifier needs to handle the case of nested containers, and
of container migration from one system to another.

The issue generally appears to be that we have plent of ids that can
serve that purpose but there is insufficient agreement on what
constitues a container.

So at this point just no.  You quoted my technical description of why
there does not exist such a thing as a container id, and have completely
ignored the technical reasons.

Eric

--
Linux-audit mailing list
Linux-audit@redhat.com
https://www.redhat.com/mailman/listinfo/linux-audit


Re: [PATCH] prctl: remove one-shot limitation for changing exe link

2016-08-01 Thread Eric W. Biederman
Mateusz Guzik  writes:

> On Sat, Jul 30, 2016 at 12:31:40PM -0500, Eric W. Biederman wrote:
>> So what I am requesting is very simple.  That the checks in
>> prctl_set_mm_exe_file be tightened up to more closely approach what
>> execve requires.  Thus preserving the value of the /proc/[pid]/exe for
>> the applications that want to use the exe link.
>> 
>> Once the checks in prctl_set_mm_exe_file are tightened up please feel
>> free to remove the one shot test.
>> 
>
> This is more fishy.
>
> First of all exe_file is used by the audit subsystem. So someone has to
> ask audit people what is the significance (if any) of the field.
>
> All exe_file users but one use get_mm_exe_file and handle NULL
> gracefully.
>
> Even with the current limit of changing the field once, the user can
> cause a transient failure of get_mm_exe_file which can fail to increment
> the refcount before it drops to 0.
>
> This transient failure can be used to get a NULL value stored in
> ->exe_file during fork (in dup_mmap):
> RCU_INIT_POINTER(mm->exe_file, get_mm_exe_file(oldmm));
>
> The one place which is not using get_mm_exe_file to get to the pointer
> is audit_exe_compare:
> rcu_read_lock();
> exe_file = rcu_dereference(tsk->mm->exe_file);
> ino = exe_file->f_inode->i_ino;
> dev = exe_file->f_inode->i_sb->s_dev;
> rcu_read_unlock();
>
> This is buggy on 2 accounts:
> 1. exe_file can be NULL
> 2. rcu does not protect f_inode
>
> The issue is made worse with allowing arbitrary number changes.
>
> Modifying get_mm_exe_file to retry is trivial and in effect never return
> NULL is trivial. With arbitrary number of changes allowed this may
> require some cond_resched() or something.
>
> For comments I cc'ed Richard Guy Briggs, who is both an audit person and
> the author of audit_exe_compare.

That is fair.  Keeping the existing users working is what needs to
happen.

At the same time we have an arbitrary number of possible changes with
exec, but I guess that works differently because the mm is changed as
well.

So yes let's bug fix this piece of code and then we can see about
relaxing constraints.

Eric

--
Linux-audit mailing list
Linux-audit@redhat.com
https://www.redhat.com/mailman/listinfo/linux-audit


Re: [PATCH] capabilities: audit capability use

2016-07-12 Thread Eric W. Biederman
Topi Miettinen  writes:

> On 07/11/16 21:57, Eric W. Biederman wrote:
>> Topi Miettinen  writes:
>> 
>>> There are many basic ways to control processes, including capabilities,
>>> cgroups and resource limits. However, there are far fewer ways to find
>>> out useful values for the limits, except blind trial and error.
>>>
>>> Currently, there is no way to know which capabilities are actually used.
>>> Even the source code is only implicit, in-depth knowledge of each
>>> capability must be used when analyzing a program to judge which
>>> capabilities the program will exercise.
>>>
>>> Generate an audit message at system call exit, when capabilities are used.
>>> This can then be used to configure capability sets for services by a
>>> software developer, maintainer or system administrator.
>>>
>>> Test case demonstrating basic capability monitoring with the new
>>> message types 1330 and 1331 and how the cgroups are displayed (boot to
>>> rdshell):
>> 
>> You totally miss the interactions with the user namespace so this won't
>> give you the information you are aiming for.
>
> Please correct me if this is not right:
>
> There are two cases:
> a) real capability use as seen outside the namespace
> b) use of capabilities granted by the namespace
> Both cases could be active independently.
>
> For auditing purposes, we're  mostly interested in a) and log noise from
> b) could be even seen a distraction.
>
> For configuration purposes, both cases can be interesting, a) for the
> configuration of  services and b) in case where the containerized
> configuration is planned to be deployed outside. I'd still only log
> a).
>
>
> The same logic should apply with cgroup namespaces.

Not logging capabilities outside of the initial user namespace is
certainly the conservative place to start, and what selinux does.

You should also be logging capability use from cap_capable.  Not
ns_capable.  You are missing several kinds of capability use as
a quick review of kernel/capability.c should have shown you.

Eric

--
Linux-audit mailing list
Linux-audit@redhat.com
https://www.redhat.com/mailman/listinfo/linux-audit


Re: [PATCH] capabilities: audit capability use

2016-07-11 Thread Eric W. Biederman
Topi Miettinen  writes:

> There are many basic ways to control processes, including capabilities,
> cgroups and resource limits. However, there are far fewer ways to find
> out useful values for the limits, except blind trial and error.
>
> Currently, there is no way to know which capabilities are actually used.
> Even the source code is only implicit, in-depth knowledge of each
> capability must be used when analyzing a program to judge which
> capabilities the program will exercise.
>
> Generate an audit message at system call exit, when capabilities are used.
> This can then be used to configure capability sets for services by a
> software developer, maintainer or system administrator.
>
> Test case demonstrating basic capability monitoring with the new
> message types 1330 and 1331 and how the cgroups are displayed (boot to
> rdshell):

You totally miss the interactions with the user namespace so this won't
give you the information you are aiming for.

Eric

--
Linux-audit mailing list
Linux-audit@redhat.com
https://www.redhat.com/mailman/listinfo/linux-audit


Re: [PATCH V6 05/10] audit: log creation and deletion of namespace instances

2015-05-16 Thread Eric W. Biederman
Paul Moore  writes:

> On Sat, May 16, 2015 at 5:46 AM, Daniel J Walsh  wrote:
>> On 05/15/2015 05:05 PM, Paul Moore wrote:
>>> On Thursday, May 14, 2015 11:23:09 PM Andy Lutomirski wrote:
 On Thu, May 14, 2015 at 7:32 PM, Richard Guy Briggs  
 wrote:
> On 15/05/14, Paul Moore wrote:
>> * Look at our existing audit records to determine which records should
>> have
>> namespace and container ID tokens added.  We may only want to add the
>> additional fields in the case where the namespace/container ID tokens are
>> not the init namespace.
> If we have a record that ties a set of namespace IDs with a container
> ID, then I expect we only need to list the containerID along with auid
> and sessionID.
 The problem here is that the kernel has no concept of a "container", and I
 don't think it makes any sense to add one just for audit.  "Container" is a
 marketing term used by some userspace tools.

 I can imagine that both audit could benefit from a concept of a
 namespace *path* that understands nesting (e.g. root/2/5/1 or
 something along those lines).  Mapping these to "containers" belongs
 in userspace, I think.
>>> It might be helpful to climb up a few levels in this thread ...
>>>
>>> I think we all agree that containers are not a kernel construct.  I further
>>> believe that the kernel has no business generating container IDs, those 
>>> should
>>> come from userspace and will likely be different depending on how you define
>>> "container".  However, what is less clear to me at this point is how the
>>> kernel should handle the setting, reporting, and general management of this
>>> container ID token.
>>>
>> Wouldn't the easiest thing be to just treat add a containerid to the
>> process context like auid.
>
> I believe so.  At least that was the point I was trying to get across
> when I first jumped into this thread.

It sounds nice but containers are not just a per process construct.
Sometimes you might know anamespace but not which process instigated
action to happen on that namespace.

>> Then make it a privileged operation to set it.  Then tools that care about
>> auditing like docker can set the ID
>> and remove the Capability from it sub processes if it cares.  All
>> processes adopt parent processes containerid.
>> Now containers can be audited and as long as userspace is written
>> correctly nested containers can either override the containerid or not
>> depending on what the audit rules are.
>
> This part I'm still less certain on.  I agree that setting the
> container ID should be privileged in some sense, but the kernel
> shouldn't *require* privilege to create a new container (however the
> user chooses to define it).  Simply requiring privilege to set the
> container ID and failing silently may be sufficient.

My hope is as things mature fewer and fewer container things will need
any special privilege to create.

I think it needs to start with a clear definition of what is wanted and
then working backwards through which messages in which contexts you want
to have your magic bits.

Eric

--
Linux-audit mailing list
Linux-audit@redhat.com
https://www.redhat.com/mailman/listinfo/linux-audit


Re: [PATCH V6 05/10] audit: log creation and deletion of namespace instances

2015-05-15 Thread Eric W. Biederman
Steve Grubb  writes:

> On Thursday, May 14, 2015 08:31:45 PM Eric W. Biederman wrote:
>> Paul Moore  writes:
>> > As Eric, and others, have stated, the container concept is a userspace
>> > idea, not a kernel idea; the kernel only knows, and cares about,
>> > namespaces.  This is unlikely to change.
>> > 
>> > However, as Steve points out, there is precedence for the kernel to record
>> > userspace tokens for the sake of audit.  Personally I'm not a big fan of
>> > this in general, but I do recognize that it does satisfy a legitimate
>> > need.  Think of things like auid and the sessionid as necessary evils;
>> > audit is already chock full of evilness I doubt one more will doom us all
>> > to hell.
>> > 
>> > Moving forward, I'd like to see the following:
>> > 
>> > * Create a container ID token (unsigned 32-bit integer?), similar to
>> > auid/sessionid, that is set by userspace and carried by the kernel to be
>> > used in audit records.  I'd like to see some discussion on how we manage
>> > this, e.g. how do handle container ID inheritance, how do we handle
>> > nested containers (setting the containerid when it is already set), do we
>> > care if multiple different containers share the same namespace config,
>> > etc.?
>> > 
>> > Can we all live with this?  If not, please suggest some alternate ideas;
>> > simply shouting "IT'S ALL CRAP!" isn't helpful for anyone ... it may be
>> > true, but it doesn't help us solve the problem ;)
>> 
>> Without stopping and defining what someone means by container I think it
>> is pretty much nonsense.
>
> Maybe this is what's hanging everyone up? Its easy to get lost when your view 
> is down at the syscall level and what is happening in the kernel. Starting a 
> container is akin to the idea of login. Not every call to setresuid is a 
> login. It could be a setuid program starting or a daemon dropping privileges. 
> The idea of a container is a higher level concept that starting a name space. 
> I think comparing a login with a container is a useful analogy because both 
> are higher level concepts but employ low level ideas. A login is a collection 
> of chdir, setuid, setgid, allocating a tty, associating the first 3 file 
> descriptors, setting a process group, and starting a specific executable. All 
> these low level concepts each by itself is not special.

Except login and setresuid are privileged operation.

CREATING A CONTAINER IS NOT A PRIVILGED OPERATION.
Your analagy fails rather badly with respect to that fact.

> A container is what we need auditing events around not creation of 
> namespaces. 
> If we want creation of namespaces, we can audit the clone/unshare/setns 
> syscalls. The container is when a managing program such as docker, lxc, or 
> sometimes systemd creates a special operating environment for the express 
> purpose of running programs disassociated in some way from the parent 
> namespaces, cgroups, and security assumptions. Its this orchestration, just 
> as 
> sshd orchestrates a login, that makes it different.

What do you define as a container?  From what I can tell we share
a similiar understanding of the term, and running lxc is not a
privileged operation.  Running sandstorm.io is not a privileged
operation.

>> Should every vsftp connection get a container every?  Every chrome tab?
>
> No. Also, note that not every program that grants a user session constitutes 
> a 
> login.

>> At some of the connections per second numbers I have seen we might
>> exhaust a 32bit number in an hour or two.  Will any of that make sense
>> to someone reading the audit logs?
>
> I would agree if we were auditing creation of name spaces. But going back to 
> the concept of login, these could occur at a high rate. This is a bruteforce 
> login attack. We put countermeasures in place to prevent it. But it is 
> possible for the session id to wrap. But in our case, things like lxc or 
> docker don't start hundreds of these a minute.

Except there are reasonable situtations where container creation does
happen at fast rates.  Outside of a container per network connection
(which is likely to happen at some point) I have seen builds fire up
more containers than I can count as part of automated testing.

>> Without considerning that container creation is an unprivileged
>> operation I think it is pretty much nonsense.  Do I get to say I am any
>> container I want?  That would seem to invalidate the concept of
>> userspace setting a container id.
>
> It would need to be a privileged operation just as setuid is.

CONTAINER CR

Re: [PATCH V6 05/10] audit: log creation and deletion of namespace instances

2015-05-15 Thread Eric W. Biederman
Paul Moore  writes:
> As Eric, and others, have stated, the container concept is a userspace idea, 
> not a kernel idea; the kernel only knows, and cares about, namespaces.  This 
> is unlikely to change.
>
> However, as Steve points out, there is precedence for the kernel to record 
> userspace tokens for the sake of audit.  Personally I'm not a big fan of this 
> in general, but I do recognize that it does satisfy a legitimate need.  Think 
> of things like auid and the sessionid as necessary evils; audit is already 
> chock full of evilness I doubt one more will doom us all to hell.
>
> Moving forward, I'd like to see the following:

> * Create a container ID token (unsigned 32-bit integer?), similar to 
> auid/sessionid, that is set by userspace and carried by the kernel to be used 
> in audit records.  I'd like to see some discussion on how we manage this, 
> e.g. 
> how do handle container ID inheritance, how do we handle nested containers 
> (setting the containerid when it is already set), do we care if multiple 
> different containers share the same namespace config, etc.?


> Can we all live with this?  If not, please suggest some alternate ideas; 
> simply shouting "IT'S ALL CRAP!" isn't helpful for anyone ... it may be true, 
> but it doesn't help us solve the problem ;)

Without stopping and defining what someone means by container I think it
is pretty much nonsense.

Should every vsftp connection get a container every?  Every chrome tab?

At some of the connections per second numbers I have seen we might
exhaust a 32bit number in an hour or two.  Will any of that make sense
to someone reading the audit logs?

Without considerning that container creation is an unprivileged
operation I think it is pretty much nonsense.  Do I get to say I am any
container I want?  That would seem to invalidate the concept of
userspace setting a container id.

How does any of this interact with setns?  AKA entering a container?

I will go as far as looking at patches.  If someone comes up with
a mission statement about what they are actually trying to achieve and a
mechanism that actually achieves that, and that allows for containers to
nest we can talk about doing something like that.

But for right now I just hear proposals for things that make no sense
and can not possibly work.  Not least because it will require modifying
every program that creates a container and who knows how many of them
there are.  Especially since you don't need to be root.  Modifying
/usr/bin/unshare seems a little far out to me.

Eric




--
Linux-audit mailing list
Linux-audit@redhat.com
https://www.redhat.com/mailman/listinfo/linux-audit


Re: [PATCH V6 05/10] audit: log creation and deletion of namespace instances

2015-05-14 Thread Eric W. Biederman
Steve Grubb  writes:

> On Tuesday, May 12, 2015 03:57:59 PM Richard Guy Briggs wrote:
>> On 15/05/05, Steve Grubb wrote:
>> > I think there needs to be some more discussion around this. It seems like
>> > this is not exactly recording things that are useful for audit.
>> 
>> It seems to me that either audit has to assemble that information, or
>> the kernel has to do so.  The kernel doesn't know about containers
>> (yet?).
>
> Auditing is something that has a lot of requirements imposed on it by 
> security 
> standards. There was no requirement to have an auid until audit came along 
> and 
> said that uid is not good enough to know who is issuing commands because of 
> su 
> or sudo. There was no requirement for sessionid until we had to track each 
> action back to a login so we could see if the login came from the expected 
> place. 

Stop right there.

You want a global identifier in a realm where only relative identifiers
exist, and make sense.

I am sorry that isn't going to happen. EVER.

Square peg, round hole.  It doesn't work, it doesn't make sense, and
most especially it doesn't allow anyone to reconstruct anything, because
it does not make sense and does not match what the kernel is doing.

Container IDs do not, and will not exist.  There is probably something
reasonable in your request but until you stop talking that nonsense I
can't see it.

Global IDs take us into the namespace of namespaces problem and that
isn't going to happen.  I have already bent as far in this direction as
I can go.  Further namespace creation is not a privileged event which
makes the requestion for a container ID make even less sense.  With
anyone able to create whatever they want it will not be a identifier
that makes any sense to someone reading an audit log.

Eric

--
Linux-audit mailing list
Linux-audit@redhat.com
https://www.redhat.com/mailman/listinfo/linux-audit


Re: [PATCH V6 05/10] audit: log creation and deletion of namespace instances

2015-05-05 Thread Eric W. Biederman
Steve Grubb  writes:

> The requirements for auditing of containers should be derived from VPP. In 
> it, 
> it asks for selectable auditing, selective audit, and selective audit review. 
> What this means is that we need the container and all its children to have 
> one 
> identifier that is inserted into all the events that are associated with the 
> container.

That is technically impossible.  Nested containers exist.

That is when container G is nested in container F which is in turn
nested in container E which is in turn nested in container D which is in
turn nested in container C which is in turn nested in container B which
is nested in container A there is no one label you can put on audit
messages from container G which is the ``correct'' one.

Or are you proposing that something in container G have labels
A B C D E F G included on every audit message?   That introduces enough
complexity in generating and parsing the messages I wouldn't trust those
messages as the least bug in generation and parsing would be a security
issue.

What is the world is VPP?  It sounds like something non-public thing.
Certainly it has never been a part of the public container discussion
and as such it appears to be completely ridiculous to bring up in a
public discussion.

Eric

--
Linux-audit mailing list
Linux-audit@redhat.com
https://www.redhat.com/mailman/listinfo/linux-audit


Re: [PATCH V6 00/10] namespaces: log namespaces per task

2015-04-28 Thread Eric W. Biederman
Richard Guy Briggs  writes:

> On 15/04/24, Eric W. Biederman wrote:
>> Richard Guy Briggs  writes:
>> > On 15/04/22, Richard Guy Briggs wrote:
>> >> On 15/04/20, Eric W. Biederman wrote:
>> >> > Richard Guy Briggs  writes:
>> >> > 
>> >
>> > Do I even need to report the device number anymore since I am concluding
>> > s_dev is never set (or always zero) in the nsfs filesystem by
>> > mount_pseudo() and isn't even mountable? 
>> 
>> We still need the dev. We do have a device number get_anon_bdev fills it in.
>
> Fine, it has a device number.  There appears to be only one of these
> allocated per kernel.  I can get it from &nsfs->fs_supers (and take the
> first instance given by hlist_for_each_entry and verify there are no
> others).  Why do I need it, again?

Because if we have to preserve the inode number over a migration event I
want to preserve the fact that we are talking about inode numbers from a
superblock with a device number.

Otherwise known as I am allergic to kernel global identifiers, because
they can be major pains.  I don't want to have to go back and implement
a namespace for namespaces.

>> >> They are all covered:
>> >> sys_unshare > unshare_userns > create_user_ns
>> >> sys_unshare > unshare_nsproxy_namespaces > create_new_namespaces > 
>> >> copy_mnt_ns
>> >> sys_unshare > unshare_nsproxy_namespaces > create_new_namespaces > 
>> >> copy_utsname > clone_uts_ns
>> >> sys_unshare > unshare_nsproxy_namespaces > create_new_namespaces > 
>> >> copy_ipcs > get_ipc_ns
>> >> sys_unshare > unshare_nsproxy_namespaces > create_new_namespaces > 
>> >> copy_pid_ns > create_pid_namespace
>> >> sys_unshare > unshare_nsproxy_namespaces > create_new_namespaces > 
>> >> copy_net_ns
>> 
>> Then why the special change to fork?  That was not reflected on
>> the unshare path as far as I could see.
>
> Fork can specify more than one CLONE flag at once, so collecting them
> all in one statementn seemed helpful.  setns can only set one at a time.

unshare can also specify more than one CLONE flag at once.

I just pointed that out becase that seemed really unsymmetrical.

> Ok, understood, we can't just punt this one to a higher layer...
>
> So this comes back to a question above, which is how do we determine
> which device it is from?  Sounds like we need something added to
> ns_common or one of the 6 namespace types structs.

Or we can just hard code reading it off of the appropriate magic
filesystem.  Probably what we want is a well named helper function that
does the job.

I just care that when we talk about these things we are talking about
inode numbers from a superblock that is associated with a given device
number.  That way I don't have nightmares about dealing with a namespace
for namespaces.

Eric

--
Linux-audit mailing list
Linux-audit@redhat.com
https://www.redhat.com/mailman/listinfo/linux-audit


Re: [PATCH V6 00/10] namespaces: log namespaces per task

2015-04-24 Thread Eric W. Biederman
Richard Guy Briggs  writes:

> On 15/04/22, Richard Guy Briggs wrote:
>> On 15/04/20, Eric W. Biederman wrote:
>> > Richard Guy Briggs  writes:
>> > 
>> > > The purpose is to track namespace instances in use by logged processes 
>> > > from the
>> > > perspective of init_*_ns by logging the namespace IDs (device ID and 
>> > > namespace
>> > > inode - offset).
>> > 
>> > In broad strokes the user interface appears correct.
>> > 
>> > Things that I see that concern me:
>> > 
>> > - After Als most recent changes these inodes no longer live in the proc
>> >   superblock so the device number reported in these patches is
>> >   incorrect.
>> 
>> Ok, found the patchset you're talking about:
>>  3d3d35b kill proc_ns completely
>>  e149ed2 take the targets of /proc/*/ns/* symlinks to separate fs
>>  f77c801 bury struct proc_ns in fs/proc
>>  33c4294 copy address of proc_ns_ops into ns_common
>>  6344c43 new helpers: ns_alloc_inum/ns_free_inum
>>  6496452 make proc_ns_operations work with struct ns_common * instead of 
>> void *
>>  3c04118 switch the rest of proc_ns_operations to working with &...->ns
>>  ff24870 netns: switch ->get()/->put()/->install()/->inum() to working 
>> with &net->ns
>>  58be2825 make mntns ->get()/->put()/->install()/->inum() work with 
>> &mnt_ns->ns
>>  435d5f4 common object embedded into various struct ns
>> 
>> Ok, I've got some minor jigging to do to get inum too...
>
> Do I even need to report the device number anymore since I am concluding
> s_dev is never set (or always zero) in the nsfs filesystem by
> mount_pseudo() and isn't even mountable? 

We still need the dev. We do have a device number get_anon_bdev fills it in.

> In fact, I never needed to
> report the device since proc ida/idr and inodes are kernel-global and
> namespace-oblivious.

This is the bit I really want to keep to be forward looking.  If we
every need to preserve the inode numbers across a migration we could
have different super blocks with different inode numbers for the same
namespace.

>> > - I am nervous about audit logs being flooded with users creating lots
>> >   of namespaces.  But that is more your lookout than mine.
>> 
>> There was a thought to create a filter to en/disable this logging...
>> It is an auxiliary record to syscalls, so they can be ignored by userspace 
>> tools.
>> 
>> > - unshare is not logging when it creates new namespaces.
>> 
>> They are all covered:
>> sys_unshare > unshare_userns > create_user_ns
>> sys_unshare > unshare_nsproxy_namespaces > create_new_namespaces > 
>> copy_mnt_ns
>> sys_unshare > unshare_nsproxy_namespaces > create_new_namespaces > 
>> copy_utsname > clone_uts_ns
>> sys_unshare > unshare_nsproxy_namespaces > create_new_namespaces > copy_ipcs 
>> > get_ipc_ns
>> sys_unshare > unshare_nsproxy_namespaces > create_new_namespaces > 
>> copy_pid_ns > create_pid_namespace
>> sys_unshare > unshare_nsproxy_namespaces > create_new_namespaces > 
>> copy_net_ns

Then why the special change to fork?  That was not reflected on
the unshare path as far as I could see.

>> > As small numbers are nice and these inodes all live in their own
>> > superblock now we should be able to remove the games with
>> > PROC_DYNAMIC_FIRST and just use small numbers for these inodes
>> > everywhere.
>> 
>> That is compelling if I can untangle the proc inode allocation code from the
>> ida/idr.  Should be as easy as defining a new ns_alloc_inum (and 
>> ns_free_inum)
>> to use instead of proc_alloc_inum with its own ns_inum_ida and ns_inum_lock,
>> then defining a NS_DYNAMIC_FIRST and defining NS_{IPC,UTS,USER,PID}_INIT_INO 
>> in
>> the place of the existing PROC_*_INIT_INO.

Something like that.  Just a new ida/idr allocator specific to that
superblock.

Yeah.  It is somewhere on my todo, but I have been prioritizing getting
the bugs that look potentially expoloitable fixed in the mount
namespace.  Al made things nice for one case but left a mess for a bunch
of others.

>> > I honestly don't know how much we are going to care about namespace ids
>> > during migration.  So far this is not a problem that has come up.
>> 
>> Not for CRIU, but it will be an issue for a container auditor that aggregates
>> information from individually auditted hosts.
>> 
>> > I don't think migratio

Re: [PATCH V6 00/10] namespaces: log namespaces per task

2015-04-21 Thread Eric W. Biederman
Richard Guy Briggs  writes:

> The purpose is to track namespace instances in use by logged processes from 
> the
> perspective of init_*_ns by logging the namespace IDs (device ID and namespace
> inode - offset).

In broad strokes the user interface appears correct.

Things that I see that concern me:

- After Als most recent changes these inodes no longer live in the proc
  superblock so the device number reported in these patches is
  incorrect.

- I am nervous about audit logs being flooded with users creating lots
  of namespaces.  But that is more your lookout than mine.

- unshare is not logging when it creates new namespaces.

As small numbers are nice and these inodes all live in their own
superblock now we should be able to remove the games with
PROC_DYNAMIC_FIRST and just use small numbers for these inodes
everywhere.

I have answered your comments below.

> 1/10 exposes proc's ns entries structure which lists a number of useful
> operations per namespace type for other subsystems to use.
>
> 2/10  proc_ns: define PROC_*_INIT_INO in terms of PROC_DYNAMIC_FIRST
>
> 3/10 provides an example of usage for audit_log_task_info() which is used by
> syscall audits, among others.  audit_log_task() and 
> audit_common_recv_message()
> would be other potential use cases.
>
> Proposed output format:
> This differs slightly from Aristeu's patch because of the label conflict with
> "pid=" due to including it in existing records rather than it being a seperate
> record.  It has now returned to being a seperate record.  The proc device
> major/minor are listed in hexadecimal and namespace IDs are the proc inode
> minus the base offset.
>   type=NS_INFO msg=audit(1408577535.306:82): dev=00:03 netns=3 utsns=-3 
> ipcns=-4 pidns=-1 userns=-2 mntns=0
>
> 4/10 change audit startup from __initcall to subsys_initcall to get it started
> earlier to be able to receive initial namespace log messages.
>
> 5/10 tracks the creation and deletion of namespaces, listing the type of
> namespace instance, proc device ID, related namespace id if there is one and
> the newly minted namespace ID.
>
> Proposed output format for initial namespace creation:
>   type=AUDIT_NS_INIT_UTS msg=audit(1408577534.868:5): pid=1 uid=0 
> auid=4294967295 ses=4294967295 subj=kernel dev=00:03 old_utsns=(none) 
> utsns=-3 res=1
>   type=AUDIT_NS_INIT_USER msg=audit(1408577534.868:6): pid=1 uid=0 
> auid=4294967295 ses=4294967295 subj=kernel dev=00:03 old_userns=(none) 
> userns=-2 res=1
>   type=AUDIT_NS_INIT_PID msg=audit(1408577534.868:7): pid=1 uid=0 
> auid=4294967295 ses=4294967295 subj=kernel dev=00:03 old_pidns=(none) 
> pidns=-1 res=1
>   type=AUDIT_NS_INIT_MNT msg=audit(1408577534.868:8): pid=1 uid=0 
> auid=4294967295 ses=4294967295 subj=kernel dev=00:03 old_mntns=(none) mntns=0 
> res=1
>   type=AUDIT_NS_INIT_IPC msg=audit(1408577534.868:9): pid=1 uid=0 
> auid=4294967295 ses=4294967295 subj=kernel dev=00:03 old_ipcns=(none) 
> ipcns=-4 res=1
>   type=AUDIT_NS_INIT_NET msg=audit(1408577533.500:10): pid=1 uid=0 
> auid=4294967295 ses=4294967295 subj=kernel dev=00:03 old_netns=(none) netns=2 
> res=1
>
> And a CLONE action would result in:
>   type=type=AUDIT_NS_INIT_NET msg=audit(1408577535.306:81): pid=481 uid=0 
> auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 dev=00:03 
> old_netns=2 netns=3 res=1
>
> While deleting a namespace would result in:
>   type=type=AUDIT_NS_DEL_MNT msg=audit(1408577552.221:85): pid=481 uid=0 
> auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 dev=00:03 
> mntns=4 res=1
>
> 6/10 accepts a PID from userspace and requests logging an AUDIT_NS_INFO record
> type (CAP_AUDIT_CONTROL required).
>
> 7/10 is a macro for CLONE_NEW_* flags.
>
> 8/10 adds auditing on creation of namespace(s) in fork.
>
> 9/10 adds auditing a change of namespace on setns.
>
> 10/10 attaches a AUDIT_NS_INFO record to AUDIT_VIRT_CONTROL records
> (CAP_AUDIT_WRITE required).
>
>
> v5 -> v6:
>   Switch to using namespace ID based on namespace proc inode minus base 
> offset
>   Added proc device ID to qualify proc inode reference
>   Eliminate exposed /proc interface
>
> v4 -> v5:
>   Clean up prototypes for dependencies on CONFIG_NAMESPACES.
>   Add AUDIT_NS_INFO record type to AUDIT_VIRT_CONTROL record.
>   Log AUDIT_NS_INFO with PID.
>   Move /proc//ns_* patches to end of patchset to deprecate them.
>   Log on changing ns (setns).
>   Log on creating new namespaces when forking.
>   Added a macro for CLONE_NEW*.
>
> v3 -> v4:
>   Seperate out the NS_INFO message from the SYSCALL message.
>   Moved audit_log_namespace_info() out of audit_log_task_info().
>   Use a seperate message type per namespace type for each of INIT/DEL.
>   Make ns= easier to search across NS_INFO and NS_INIT/DEL_XXX msg types.
>   Add /proc//ns/ documentation.
>   Fix dynamic initial ns logging.
>
> v2 -> v3:
>   Use atomic64_t in ns_s

Re: [PATCH V4 1/8] namespaces: assign each namespace instance a serial number

2014-08-29 Thread Eric W. Biederman
Richard Guy Briggs  writes:

> On 14/08/23, Eric W. Biederman wrote:
>> Richard Guy Briggs  writes:
>> 
>> > Generate and assign a serial number per namespace instance since boot.
>> >
>> > Use a serial number per namespace (unique across one boot of one kernel)
>> > instead of the inode number (which is claimed to have had the right to 
>> > change
>> > reserved and is not necessarily unique if there is more than one proc fs) 
>> > to
>> > uniquely identify it per kernel boot.
>> 
>> This approach is just broken.
>> 
>> For this to work with migration (aka criu) you need to implement a
>> namespace of namespaces.  You haven't done this, and therefore
>> such an interface will break existing userspace.
>> 
>> Inside of audit I can understand not caring about these issues,
>> but you go foward and expose these serial numbers in proc,
>> and generally make this infrastructure available to others.
>> 
>> The deep issue with migration is that we move tasks from one machine
>> from another and on the destination machine we need to have all of the
>> same global identifiers for software to function properly.
>> 
>> My weasel words around the proc inode numbers is to preserve to allow us
>> room to be able to restore those ids if it every becomes relevant for
>> migration.
>
> What do you do if the inode number is already in use on the target
> host?

Since the inode numbers are relative to a superblock or a pid namespace
the numbers that are in use can be restored on the target system
by creating them in the appropriate namespace.

The support does not exist in the kernel today for doing that because no
one has cared but as architected the support can be added if needed to
support migration.

>> That is the proc inode numbers (technically) live in a pid namespace,
>> (aka a mount of proc).  So depending on the pid namespace you are in
>> or the mount of proc you look in the numbers could change.
>> 
>> Qualifications like that must exist to have a prayer of ever supporting
>> process migration in the crazy corner cases where people start caring
>> about inode numbers.
>> 
>> We currently don't and inode numbers for a namespace will never change
>> after a namespace is created.  So I think you really are ok using the
>> proc inode numbers.  I am happy declaring by fiat that the inode numbers
>> that audit uses are the numbers connected to the initial pid namespace.
>
> But once a namespace/container is migrated, it is a different audit that
> is looking at it (unless we create an audit manager or entity that
> functions at the level of a container manager), so audit should not care.

These numbers were exported to everyone as a general purpose facility in
proc.  If audit is global and audit doesn't migrate you are right it
doesn't matter.  However if these numbers are used by anyone else for
anything else it causes a problem.

Further given that people run entire distributions in containers we may
reach the point where we wish to run auditd in a container in the
future.  I would hate to paint ourselves into a corner with a design
that could never allow audit to migrate.  Support that case someday
seems a valid naive desire.

>> At a fairly basic level anything that is used to identify namespaces for
>> any general purpose use needs to have most if not all of the same
>> properties of the proc inode numbers.  The most important of which is
>> being tied to some context/namespace so there is a ability if we ever
>> need it to migrate those numbers from one machine to another.
>
> Sooo...  does it make any sense to have those inode or serial numbers be
> blank inside the namespace/container itself, but only visible to its
> manager outside the container (unless it is the initial namespace)?

Mostly I think it makes sense to use the inode numbers from the initial
pid namespace.  They already exist.  They already are unique.  (Which
means I don't need to maintain more code and more special cases).  And
the do what you need now.

I probably haven't followed closely enough but I don't see what makes
inode numbers undesirable.

Eric

--
Linux-audit mailing list
Linux-audit@redhat.com
https://www.redhat.com/mailman/listinfo/linux-audit


Re: [PATCH V4 1/8] namespaces: assign each namespace instance a serial number

2014-08-23 Thread Eric W. Biederman
Richard Guy Briggs  writes:

> Generate and assign a serial number per namespace instance since boot.
>
> Use a serial number per namespace (unique across one boot of one kernel)
> instead of the inode number (which is claimed to have had the right to change
> reserved and is not necessarily unique if there is more than one proc fs) to
> uniquely identify it per kernel boot.

This approach is just broken.

For this to work with migration (aka criu) you need to implement a
namespace of namespaces.  You haven't done this, and therefore
such an interface will break existing userspace.

Inside of audit I can understand not caring about these issues,
but you go foward and expose these serial numbers in proc,
and generally make this infrastructure available to others.

The deep issue with migration is that we move tasks from one machine
from another and on the destination machine we need to have all of the
same global identifiers for software to function properly.

My weasel words around the proc inode numbers is to preserve to allow us
room to be able to restore those ids if it every becomes relevant for
migration.

That is the proc inode numbers (technically) live in a pid namespace,
(aka a mount of proc).  So depending on the pid namespace you are in
or the mount of proc you look in the numbers could change.

Qualifications like that must exist to have a prayer of ever supporting
process migration in the crazy corner cases where people start caring
about inode numbers.

We currently don't and inode numbers for a namespace will never change
after a namespace is created.  So I think you really are ok using the
proc inode numbers.  I am happy declaring by fiat that the inode numbers
that audit uses are the numbers connected to the initial pid namespace.

At a fairly basic level anything that is used to identify namespaces for
any general purpose use needs to have most if not all of the same
properties of the proc inode numbers.  The most important of which is
being tied to some context/namespace so there is a ability if we ever
need it to migrate those numbers from one machine to another.

Eric

> diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c
> index 8e78110..93cb380 100644
> --- a/kernel/nsproxy.c
> +++ b/kernel/nsproxy.c
> @@ -41,6 +41,23 @@ struct nsproxy init_nsproxy = {
>  #endif
>  };
>  
> +/**
> + * ns_serial - compute a serial number for the namespace
> + *
> + * Compute a serial number for the namespace to uniquely identify it in
> + * audit records.
> + */
> +long long ns_serial(void)
> +{
> + static atomic64_t serial = ATOMIC_INIT(4); /* reserved for IPC, UTS, 
> user, PID */
> + long long ret;
> +
> + ret = atomic64_add_return(1, &serial);
> + BUG_ON(!ret);
> +
> + return ret;
> +}
> +
>  static inline struct nsproxy *create_nsproxy(void)
>  {
>   struct nsproxy *nsproxy;

--
Linux-audit mailing list
Linux-audit@redhat.com
https://www.redhat.com/mailman/listinfo/linux-audit


Re: [PATCH V3 0/6] namespaces: log namespaces per task

2014-08-20 Thread Eric W. Biederman
Richard Guy Briggs  writes:

> On 14/05/20, Richard Guy Briggs wrote:
>> On 14/05/20, Eric Paris wrote:
>> > On Tue, 2014-05-20 at 09:12 -0400, Richard Guy Briggs wrote:
>> > > The purpose is to track namespaces in use by logged processes from the
>> > > perspective of init_*_ns.
>
> (Including the Linux API list due to the additions to /proc//ns/.
> Please see http://www.kernelhub.org/?p=2&msg=477668 and in particular
> http://www.kernelhub.org/?msg=477678&p=2 )

Sigh if you have to use something like this use the proc inode
number.  It is the same thing.

I hate to claim it is unique absent of the proc superblock but it is and
will be for the forseable future.  

It would be better to include the block device number that appears in
proc of 3h of the primary mount of to qualify the number.  But it is not
particularly important.  Coming up with an additional unique number that
needs to be maintained seems stronlgy silly.

Eric



> 
>
>> > > 5/6 provides an example of usage for audit_log_task_info() which is used 
>> > > by
>> > > syscall audits, among others.  audit_log_task() and 
>> > > audit_common_recv_message()
>> > > would be other potential use cases.
>> > > 
>> > > Proposed output format:
>> > > This differs slightly from Aristeu's patch because of the label conflict 
>> > > with
>> > > "pid=" due to including it in existing records rather than it being a 
>> > > seperate
>> > > record.  The serial numbers are printed in hex.
>> > >  type=SYSCALL msg=audit(1399651071.433:72): arch=c03e syscall=272 
>> > > success=yes exit=0 a0=4000 a1= a2=0 a3=22 items=0 
>> > > ppid=1 pid=483 auid=4294967295 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 
>> > > sgid=0 fsgid=0 tty=(none) ses=4294967295 comm="(t-daemon)" 
>> > > exe="/usr/lib/systemd/systemd" netns=97 utsns=2 ipcns=1 pidns=4 userns=3 
>> > > mntns=5 subj=system_u:system_r:init_t:s0 key=(null)
>> > 
>> > I'm undecided if I'd rather see this as a separate NS_INFO record type.
>> > It would mean we could filter them out of the logs...
>> 
>> I don't have a strong opinion either way.  Steve G.'s opinion would be
>> helpful here.
>
> Breaking this out into a seperate record type would mean calling
> audit_log_namespace_info() from the callers of audit_log_task_info()
> (presently audit_log_link_denied(), ima_audit_measurement(),
> audit_log_exit() ), but would make it easier to emit with other record
> types.
>
>> > Do we print out lots of pidns=0 for tasks not in a newly created NS?  Do
>> > we want to?
>> 
>> There is no "pidns=0", but I understand your point.  This would come
>> back to Steve G.'s point about disappearing fields, and the value of
>> having it as a seperate record that could be filtered.
>> 
>> > > 6/6 tracks the creation and deletion of of namespaces, listing the type 
>> > > of
>> > > namespace instance, related namespace id if there is one and the newly 
>> > > minted
>> > > serial number.
>> > > 
>> > > Proposed output format:
>> > >  type=NS_INIT msg=audit(1400217435.706:94): pid=524 uid=0 
>> > > auid=4294967295 ses=4294967295 subj=system_u:system_r:mount_t:s0 
>> > > type=2 old_snum=0 snum=a1 res=1
>> > 
>> > I'd love to be able to grep for netns=20 and find both the NS_INIT and
>> > the SYSCALL/NS_INFO records, instead of having them named different
>> > things.  So basically I think you want to translate the type= into a
>> > string for the old_X= and X=...
>> 
>> That actually makes a bit more sense, and we could do away with the
>> "type=" field since the "Xns=" fields are self-describing.
>
> Steve, how do you feel about this since the NS_INIT/NS_DEL record type
> would have changing fields amongst the 6 namespace types.  Only one of
> the 6 would be present in each message.  I suppose we could have an
> NS_INIT_XXX/NS_DEL_XXX for each namespace type.
>
>> - RGB
>
> - RGB
>
> --
> Richard Guy Briggs 
> Senior Software Engineer, Kernel Security, AMER ENG Base Operating Systems, 
> Red Hat
> Remote, Ottawa, Canada
> Voice: +1.647.777.2635, Internal: (81) 32635, Alt: +1.613.693.0684x3545

--
Linux-audit mailing list
Linux-audit@redhat.com
https://www.redhat.com/mailman/listinfo/linux-audit


Re: [GIT PULL] namespaces fixes for 3.14-rcX

2014-03-10 Thread Eric W. Biederman
Eric Paris  writes:

> On Sun, 2014-03-09 at 20:06 -0700, Eric W. Biederman wrote:
>> My efforts to get these fixes noticed by people who care about audit
>> seem to have landed on deaf ears, so since these are namespace related I
>> have put them in my tree.
>
> This commentary sounds like a pile of crap seeing as how there was a
> whole discussion around how to handle this stuff, which you were a part
> of.  And that the audit tree already picked up these 2 patches.

My apologies for offending.  Last night when I looked the audit tree had
not been updated in 7 weeks, and Andrew Morton wasn't ready to push
these fixes to Linus and was looking for someone who would.  Keeping
important bug fixes in my retransmit queue leaves me very grumpy.

> In any case, since I haven't sent them to Linus and I'm glad that is
> done, so feel free to consider this me Acking the pull request.

Eric

--
Linux-audit mailing list
Linux-audit@redhat.com
https://www.redhat.com/mailman/listinfo/linux-audit


[GIT PULL] namespaces fixes for 3.14-rcX

2014-03-10 Thread Eric W. Biederman

Linus,

Please pull the for-linus branch from the git tree:

   git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace.git 
for-linus

   HEAD: d211f177b28ec070c25b3d0b960aa55f352f731f audit: Update kdoc for 
audit_send_reply and audit_list_rules_send

Starting with 3.14-rc1 the audit code is faulty (think oopses and races)
with respect to how it computes the network namespace of which socket to
reply to, and I happened to notice by chance when reading through the
code.

My efforts to get these fixes noticed by people who care about audit
seem to have landed on deaf ears, so since these are namespace related I
have put them in my tree.

My testing and the automated build bots don't find any problems with
these fixes.

Eric W. Biederman (3):
  audit: Use struct net not pid_t to remember the network namespce to reply 
in
  audit: Send replies in the proper network namespace.
  audit: Update kdoc for audit_send_reply and audit_list_rules_send

 include/linux/audit.h |3 ++-
 kernel/audit.c|   31 ---
 kernel/audit.h|2 +-
 kernel/auditfilter.c  |   10 +++---
 4 files changed, 26 insertions(+), 20 deletions(-)

Eric

--
Linux-audit mailing list
Linux-audit@redhat.com
https://www.redhat.com/mailman/listinfo/linux-audit


Re: [RFC][PATCH] audit: Simplify by assuming the callers socket buffer is large enough

2014-03-05 Thread Eric W. Biederman
Steve Grubb  writes:

> On Tuesday, March 04, 2014 07:21:52 PM David Miller wrote:
>> From: ebied...@xmission.com (Eric W. Biederman)
>> Date: Tue, 04 Mar 2014 14:41:16 -0800
>> 
>> > If we really want the ability to always appened to the queue of skb's
>> > is to just have a version of netlink_send_skb that ignores the queued
>> > limits.  Of course an evil program then could force the generation of
>> > enough audit records to DOS the kernel, but we seem to be in that
>> > situation now.  Shrug.
>> 
>> There is never a valid reason to bypass the socket limits.
>> 
>> It protects the system from things going out of control.
>> 
>> Netlink packet sends can fail, and audit should cope with that
>> event instead of trying to bludgeon it into not happening.
>
> I am not sure what exactly is the problem with the code that was being 
> patched. The audit code has been stable and working pretty well for everyone 
> for the last 5-6 years. But lately there has been a lot of kernel code churn 
> and I am concerned that the changes are being made without a complete 
> understanding of the design goals.

The problem is that the audit code in the kernel is not using netlink
correctly and which leads to mistakes when the code is updated because
the code in the kernel is unnecessarily complex, and inefficient.

> The audit system has to be very reliable. It can't lose any event or record. 
> The people that really depend on it would rather have access denied to the 
> system than lose any event. This is the reason it goes to such
> lengths.

But the designers of the code didn't really go to any lengths at all.
There are enormous and cumbersome hacks to avoid fixing the kernel
interfaces to do what audit wants to use the kernel interfaces for.
That is the audit code for transmitting replies is stupid and is causing
serious maintenance issues, by growing hacks upon hacks instead of
dealing with the core issues.

The first approximation of what you want should have been be to set the
netlink socket recvbuf to as large as the can be:
setsockopt(sk, SOL_SOCKET, SO_RCVBUFFORCE, INTMAX);

If/when 2GiB is shown to be not enough people should should have worked
on the netlink layer to allow even larger piles of audit messages to be
in flight, assuming you really do want to bring down the system when the
audit client is just too slow.

> If I understand the patch, it looks like its affecting code that adds, 
> deletes, 
> or lists audit rules. This code is quite old and working well. It didn't at 
> first. We found a lot of problems by writing scripts like:

Working well does not mean implemented well, and for a security feature
not implemented well means broken.  And frankly since I can't see a
single setsockopt to set the received buffer size for audit netlink
sockets I have to stronlgy suspect that people were working around
userpsace bugs with stupid kernel code.

> #!/bin/bash
> while [ 1 ] ;
> do
> echo "Inserting syscalls..."
> sys=1
> while [ $sys -lt 100 ]
> do
> sys=`expr $sys + 1`
> echo "$sys"
> auditctl -a entry,always -S $sys
> done
>
> echo "Listing..."
> auditctl -l
> echo "Deleting..."
> auditctl -D
> done
>
> with another shell running:
>
> #!/bin/bash
> while [ 1 ] ; 
> do
>   echo "Listing..."
>   auditctl -l
> done
>
> and another shell running:
>
> #!/bin/bash
> while [ 1 ] ; 
> do
>   echo "Disabling..."
>   auditctl -e 0
>   echo "Enabling..."
>   auditctl -e 1
> done
>
> You can probably imagine more stress tests. But the proposed code should be 
> well tested similar to this.

Honestly since no one cares enough to maintain the kernel code properly
I really think we should just rip audit out instead trying to present
userspace with the delusion that the code works, and will continue to
work properly.

My apologies for being grumpy.  I have no intention of breaking
userspace (which is why my last patch was an rfc) but at the same time
I care absolutely nothing about audit.  It solves none of my problems
and works at all of the wrong layers to be interesting for me.

I just happened to notice that the code has been broken since 3.14-rc1
and no one understands or cares enough about audit to even accept
trivial kernel patches to fix the bugs.  No offense but I am not
interested in testing code I care nothing about, especially when there
is no one to pick up the patches.  My RFC patch was supposed to be a
word to the wise.

Eric

--
Linux-audit mailing list
Linux-audit@redhat.com
https://www.redhat.com/mailman/listinfo/linux-audit


Re: [RFC][PATCH] audit: Simplify by assuming the callers socket buffer is large enough

2014-03-04 Thread Eric W. Biederman
David Miller  writes:

> From: Andrew Morton 
> Date: Tue, 4 Mar 2014 13:30:04 -0800
>
>> On Fri, 28 Feb 2014 20:50:19 -0800 ebied...@xmission.com (Eric W. Biederman) 
>> wrote:
>> 
>>> 
>>> Modify audit_send_reply to directly use a non-blocking send and
>>> to return an error on failure (if anyone cares).
>>> 
>>> Modify audit_list_rules_send to use audit_send_reply and give up
>>> if we can not send a packet.
>>> 
>>> Merge audit_list_rules into iaudit_list_rules_send as the code
>>> is now sufficiently simple to not justify to callers.
>>> 
>>> Kill audit_send_list, audit_send_reply_thread because using
>>> a separate thread for replies is not needed when sending
>>> packets syncrhonously.
>> 
>> Nothing much seems to be happening here?

Well you picked up the patch that fixes the worst of the bugs that I was
complaining about.  Beyond that I don't know what makes sense.

>> In an earlier email you said "While reading through 3.14-rc1 I found a
>> pretty siginficant mishandling of network namespaces in the recent
>> audit changes." Were those recent audit changes a post-3.14 thing?  And
>> what is the user-visible effect of the mishandling?
>
> I took a quick look at this patch yesterday, and my only suspicion is
> that threads are created currently by this code purely to cons up a
> blockable context for the netlink code.
>
> Perhaps it wants to avoid any netlink packet drops from being possible.
>
> I'm not so sure.

Looking at the history (see below) the stated reason from David
Woodhouse is to prevent the removal of packets from the packet queues
when we have reached our rcvbuf limit.

The reasoning doesn't make a lick of sense to me and I was hoping the
folks dealing with audit would comment.

If we really want the ability to always appened to the queue of skb's
is to just have a version of netlink_send_skb that ignores the queued
limits.  Of course an evil program then could force the generation of
enough audit records to DOS the kernel, but we seem to be in that
situation now.  Shrug.

> Anyways, some investigation into perhaps figuring out the intentions of
> the original implementation would be nice.

I had looked briefly and missed a few things.  Here is the complete
history in all of it's confusion.

In brief.  The code came in using non-blocking netlink writes.

David Woodhouse made the writes blocking.

Al Viro, and then Eric Paris patch the code to deal with deadlocks
cause by the blocking writes.


commit 4527a30f157fca45c102ea49a2fb34a4502264eb
Author: akpm 
Date:   Mon Apr 12 20:29:12 2004 +

[PATCH] Light-weight Auditing Framework

From: Rik Faith 

This patch provides a low-overhead system-call auditing framework for Linux
that is usable by LSM components (e.g., SELinux).  This is an update of the
patch discussed in this thread:

http://marc.theaimsgroup.com/?t=10781588811&r=1&w=2

In brief, it provides for netlink-based logging of audit records that have
been generated in other parts of the kernel (e.g., SELinux) as well as the
ability to audit system calls, either independently (using simple
filtering) or as a compliment to the audit record that another part of the
kernel generated.

The main goals were to provide system call auditing with 1) as low overhead
as possible, and 2) without duplicating functionality that is already
provided by SELinux (and/or other security infrastructures).  This
framework will work "stand-alone", but is not designed to provide, e.g.,
CAPP functionality without another security component in place.

This updated patch includes changes from feedback I have received,
including the ability to compile without CONFIG_NET (and better use of
tabs, so use -w if you diff against the older patch).

Please see http://people.redhat.com/faith/audit/ for an early example
user-space client (auditd-0.4.tar.gz) and instructions on how to try it.

My future intentions at the kernel level include improving filtering (e.g.,
syscall personality/exit codes) and syscall support for more architectures.
 First, though, I'm going to work on documentation, a (real) audit daemon,
and patches for other user-space tools so that people can play with the
framework and understand how it can be used with and without SELinux.


Update:

Light-weight Auditing Framework receive filter fixes
From: Rik Faith 

Since audit_receive_filter() is only called with audit_netlink_sem held, it
cannot race with either audit_del_rule() or audit_add_rule(), so the
list_for_each_entry_rcu()s may be replaced by list_for_each_entry()s, and
the rcu_read_{u

Re: [PATCH] audit: Use struct net not pid_t to remember the network namespce to reply in

2014-03-03 Thread Eric W. Biederman
Richard Guy Briggs  writes:

> On 14/02/28, Eric W. Biederman wrote:
>> While reading through 3.14-rc1 I found a pretty siginficant mishandling
>> of network namespaces in the recent audit changes.
>> 
>> In struct audit_netlink_list and audit_reply add a reference to the
>> network namespace of the caller and remove the userspace pid of the
>> caller.  This cleanly remembers the callers network namespace, and
>> removes a huge class of races and nasty failure modes that can occur
>> when attempting to relook up the callers network namespace from a pid_t
>> (including the caller's network namespace changing, pid wraparound, and
>> the pid simply not being present).
>
> Ok, so I see that avoiding pid_t in struct audit_reply and struct
> audit_netlink_list is necessary.  Why not switch to struct pid?
>
> How does this patch solve a caller's network namespace changing?

This solves the callers network namespace changing or the caller going
away entirely (a much more serious concern) because we capture the
network namespace at the time of the request when the caller is in the
kernel.  I would have simply captured the socket we want to reply on but
there did not appear to be a good way to do that.

Reading through it again capturing current->nsproxy->net_ns is striclty
wrong.  We should be capturing sock_net(NETLINK_CB(skb).sk).  The
network namespace of the requesting socket.  That handles even weird
cases of passing file descriptors between processes in different network
namespaces.  (An incremental patch to change to code to selct the
network namespace of the requesting socket to follow in a moment).

Still what my patch implements today at least means we won't oops the
kernel if the audit process exits early, and causes get_net_ns_by_pid
to return NULL.


This whole code path is so crazy because what we really should be doing
is sending the packets in nonblocking mode and just dropping packets
if the receiving socket does not have enough socket buvffers.

Eric

--
Linux-audit mailing list
Linux-audit@redhat.com
https://www.redhat.com/mailman/listinfo/linux-audit


[RFC][PATCH] audit: Simplify by assuming the callers socket buffer is large enough

2014-03-03 Thread Eric W. Biederman

Modify audit_send_reply to directly use a non-blocking send and
to return an error on failure (if anyone cares).

Modify audit_list_rules_send to use audit_send_reply and give up
if we can not send a packet.

Merge audit_list_rules into iaudit_list_rules_send as the code
is now sufficiently simple to not justify to callers.

Kill audit_send_list, audit_send_reply_thread because using
a separate thread for replies is not needed when sending
packets syncrhonously.

Signed-off-by: "Eric W. Biederman" 
---

I haven't properly tested and made certain this doesn't break userspace,
but this is much simpler than what audit is currently doing.

I really don't understand why we are using kernel threads to allow us to
exceed the receiving sockets configured buffer limits.  That just seems
insane.  If that is really what we want we should be able to force
the receiving buffer limits up in audit_send_reply.

 include/linux/audit.h |2 +-
 kernel/audit.c|   75 -
 kernel/audit.h|   14 ++---
 kernel/auditfilter.c  |   64 ++
 4 files changed, 31 insertions(+), 124 deletions(-)

diff --git a/include/linux/audit.h b/include/linux/audit.h
index ec1464df4c60..cd2f5112822a 100644
--- a/include/linux/audit.h
+++ b/include/linux/audit.h
@@ -464,7 +464,7 @@ extern int audit_filter_user(int type);
 extern int audit_filter_type(int type);
 extern int audit_rule_change(int type, __u32 portid, int seq,
void *data, size_t datasz);
-extern int audit_list_rules_send(struct sk_buff *request_skb, int seq);
+extern void audit_list_rules_send(struct sk_buff *request_skb, int seq);
 
 extern u32 audit_enabled;
 #else /* CONFIG_AUDIT */
diff --git a/kernel/audit.c b/kernel/audit.c
index 32086bff5564..201808fc86aa 100644
--- a/kernel/audit.c
+++ b/kernel/audit.c
@@ -180,12 +180,6 @@ struct audit_buffer {
gfp_tgfp_mask;
 };
 
-struct audit_reply {
-   __u32 portid;
-   struct net *net;
-   struct sk_buff *skb;
-};
-
 static void audit_set_portid(struct audit_buffer *ab, __u32 portid)
 {
if (ab) {
@@ -496,26 +490,6 @@ static int kauditd_thread(void *dummy)
return 0;
 }
 
-int audit_send_list(void *_dest)
-{
-   struct audit_netlink_list *dest = _dest;
-   struct sk_buff *skb;
-   struct net *net = dest->net;
-   struct audit_net *aunet = net_generic(net, audit_net_id);
-
-   /* wait for parent to finish and send an ACK */
-   mutex_lock(&audit_cmd_mutex);
-   mutex_unlock(&audit_cmd_mutex);
-
-   while ((skb = __skb_dequeue(&dest->q)) != NULL)
-   netlink_unicast(aunet->nlsk, skb, dest->portid, 0);
-
-   put_net(net);
-   kfree(dest);
-
-   return 0;
-}
-
 struct sk_buff *audit_make_reply(__u32 portid, int seq, int type, int done,
 int multi, const void *payload, int size)
 {
@@ -541,25 +515,9 @@ out_kfree_skb:
return NULL;
 }
 
-static int audit_send_reply_thread(void *arg)
-{
-   struct audit_reply *reply = (struct audit_reply *)arg;
-   struct net *net = reply->net;
-   struct audit_net *aunet = net_generic(net, audit_net_id);
-
-   mutex_lock(&audit_cmd_mutex);
-   mutex_unlock(&audit_cmd_mutex);
-
-   /* Ignore failure. It'll only happen if the sender goes away,
-  because our timeout is set to infinite. */
-   netlink_unicast(aunet->nlsk , reply->skb, reply->portid, 0);
-   put_net(net);
-   kfree(reply);
-   return 0;
-}
 /**
  * audit_send_reply - send an audit reply message via netlink
- * @portid: netlink port to which to send reply
+ * @request_skb: The request skb (used to calculate where to reply)
  * @seq: sequence number
  * @type: audit message type
  * @done: done (last) flag
@@ -570,33 +528,24 @@ static int audit_send_reply_thread(void *arg)
  * Allocates an skb, builds the netlink message, and sends it to the port id.
  * No failure notifications.
  */
-static void audit_send_reply(struct sk_buff *request_skb, int seq, int type, 
int done,
-int multi, const void *payload, int size)
+int audit_send_reply(struct sk_buff *request_skb, int seq, int type, int done,
+int multi, const void *payload, int size)
 {
u32 portid = NETLINK_CB(request_skb).portid;
struct net *net = sock_net(NETLINK_CB(request_skb).sk);
+   struct audit_net *aunet = net_generic(net, audit_net_id);
struct sk_buff *skb;
-   struct task_struct *tsk;
-   struct audit_reply *reply = kmalloc(sizeof(struct audit_reply),
-   GFP_KERNEL);
-
-   if (!reply)
-   return;
 
skb = audit_make_reply(portid, seq, type, done, multi, payload, size);
if (!skb)
-   goto out;
-
-   reply->

[PATCH] audit: Send replies in the proper network namespace.

2014-03-03 Thread Eric W. Biederman

In perverse cases of file descriptor passing the current network
namespace of a process and the network namespace of a socket used by
that socket may differ.  Therefore use the network namespace of the
appropiate socket to ensure replies always go to the appropiate
socket.

Signed-off-by: "Eric W. Biederman" 
---

This is an incremental change on top of my previous patch to guarantee
that replies always happen in the appropriate network namespace.

 include/linux/audit.h |3 ++-
 kernel/audit.c|   21 ++---
 kernel/auditfilter.c  |7 +--
 3 files changed, 17 insertions(+), 14 deletions(-)

diff --git a/include/linux/audit.h b/include/linux/audit.h
index aa865a9a4c4f..ec1464df4c60 100644
--- a/include/linux/audit.h
+++ b/include/linux/audit.h
@@ -43,6 +43,7 @@ struct mq_attr;
 struct mqstat;
 struct audit_watch;
 struct audit_tree;
+struct sk_buff;
 
 struct audit_krule {
int vers_ops;
@@ -463,7 +464,7 @@ extern int audit_filter_user(int type);
 extern int audit_filter_type(int type);
 extern int audit_rule_change(int type, __u32 portid, int seq,
void *data, size_t datasz);
-extern int audit_list_rules_send(__u32 portid, int seq);
+extern int audit_list_rules_send(struct sk_buff *request_skb, int seq);
 
 extern u32 audit_enabled;
 #else /* CONFIG_AUDIT */
diff --git a/kernel/audit.c b/kernel/audit.c
index 1e5756f16f6f..32086bff5564 100644
--- a/kernel/audit.c
+++ b/kernel/audit.c
@@ -570,9 +570,11 @@ static int audit_send_reply_thread(void *arg)
  * Allocates an skb, builds the netlink message, and sends it to the port id.
  * No failure notifications.
  */
-static void audit_send_reply(__u32 portid, int seq, int type, int done,
+static void audit_send_reply(struct sk_buff *request_skb, int seq, int type, 
int done,
 int multi, const void *payload, int size)
 {
+   u32 portid = NETLINK_CB(request_skb).portid;
+   struct net *net = sock_net(NETLINK_CB(request_skb).sk);
struct sk_buff *skb;
struct task_struct *tsk;
struct audit_reply *reply = kmalloc(sizeof(struct audit_reply),
@@ -585,7 +587,7 @@ static void audit_send_reply(__u32 portid, int seq, int 
type, int done,
if (!skb)
goto out;
 
-   reply->net = get_net(current->nsproxy->net_ns);
+   reply->net = get_net(net);
reply->portid = portid;
reply->skb = skb;
 
@@ -675,8 +677,7 @@ static int audit_get_feature(struct sk_buff *skb)
 
seq = nlmsg_hdr(skb)->nlmsg_seq;
 
-   audit_send_reply(NETLINK_CB(skb).portid, seq, AUDIT_GET, 0, 0,
-&af, sizeof(af));
+   audit_send_reply(skb, seq, AUDIT_GET, 0, 0, &af, sizeof(af));
 
return 0;
 }
@@ -796,8 +797,7 @@ static int audit_receive_msg(struct sk_buff *skb, struct 
nlmsghdr *nlh)
s.backlog   = skb_queue_len(&audit_skb_queue);
s.version   = AUDIT_VERSION_LATEST;
s.backlog_wait_time = audit_backlog_wait_time;
-   audit_send_reply(NETLINK_CB(skb).portid, seq, AUDIT_GET, 0, 0,
-&s, sizeof(s));
+   audit_send_reply(skb, seq, AUDIT_GET, 0, 0, &s, sizeof(s));
break;
}
case AUDIT_SET: {
@@ -907,7 +907,7 @@ static int audit_receive_msg(struct sk_buff *skb, struct 
nlmsghdr *nlh)
   seq, data, nlmsg_len(nlh));
break;
case AUDIT_LIST_RULES:
-   err = audit_list_rules_send(NETLINK_CB(skb).portid, seq);
+   err = audit_list_rules_send(skb, seq);
break;
case AUDIT_TRIM:
audit_trim_trees();
@@ -972,8 +972,8 @@ static int audit_receive_msg(struct sk_buff *skb, struct 
nlmsghdr *nlh)
memcpy(sig_data->ctx, ctx, len);
security_release_secctx(ctx, len);
}
-   audit_send_reply(NETLINK_CB(skb).portid, seq, AUDIT_SIGNAL_INFO,
-   0, 0, sig_data, sizeof(*sig_data) + len);
+   audit_send_reply(skb, seq, AUDIT_SIGNAL_INFO, 0, 0,
+sig_data, sizeof(*sig_data) + len);
kfree(sig_data);
break;
case AUDIT_TTY_GET: {
@@ -985,8 +985,7 @@ static int audit_receive_msg(struct sk_buff *skb, struct 
nlmsghdr *nlh)
s.log_passwd = tsk->signal->audit_tty_log_passwd;
spin_unlock(&tsk->sighand->siglock);
 
-   audit_send_reply(NETLINK_CB(skb).portid, seq,
-AUDIT_TTY_GET, 0, 0, &s, sizeof(s));
+   audit_send_reply(skb, seq, AUDIT_TTY_GET, 0, 0, &s, sizeof(s));
break;
}
case AUDIT_TTY_SET: {
diff --git a/kernel/auditfilter.c b/kernel/auditfil

[PATCH] audit: Use struct net not pid_t to remember the network namespce to reply in

2014-02-28 Thread Eric W. Biederman

While reading through 3.14-rc1 I found a pretty siginficant mishandling
of network namespaces in the recent audit changes.

In struct audit_netlink_list and audit_reply add a reference to the
network namespace of the caller and remove the userspace pid of the
caller.  This cleanly remembers the callers network namespace, and
removes a huge class of races and nasty failure modes that can occur
when attempting to relook up the callers network namespace from a pid_t
(including the caller's network namespace changing, pid wraparound, and
the pid simply not being present).

Signed-off-by: "Eric W. Biederman" 
---
 kernel/audit.c   |   10 ++
 kernel/audit.h   |2 +-
 kernel/auditfilter.c |3 ++-
 3 files changed, 9 insertions(+), 6 deletions(-)

diff --git a/kernel/audit.c b/kernel/audit.c
index 34c5a2310fbf..1e5756f16f6f 100644
--- a/kernel/audit.c
+++ b/kernel/audit.c
@@ -182,7 +182,7 @@ struct audit_buffer {
 
 struct audit_reply {
__u32 portid;
-   pid_t pid;
+   struct net *net;
struct sk_buff *skb;
 };
 
@@ -500,7 +500,7 @@ int audit_send_list(void *_dest)
 {
struct audit_netlink_list *dest = _dest;
struct sk_buff *skb;
-   struct net *net = get_net_ns_by_pid(dest->pid);
+   struct net *net = dest->net;
struct audit_net *aunet = net_generic(net, audit_net_id);
 
/* wait for parent to finish and send an ACK */
@@ -510,6 +510,7 @@ int audit_send_list(void *_dest)
while ((skb = __skb_dequeue(&dest->q)) != NULL)
netlink_unicast(aunet->nlsk, skb, dest->portid, 0);
 
+   put_net(net);
kfree(dest);
 
return 0;
@@ -543,7 +544,7 @@ out_kfree_skb:
 static int audit_send_reply_thread(void *arg)
 {
struct audit_reply *reply = (struct audit_reply *)arg;
-   struct net *net = get_net_ns_by_pid(reply->pid);
+   struct net *net = reply->net;
struct audit_net *aunet = net_generic(net, audit_net_id);
 
mutex_lock(&audit_cmd_mutex);
@@ -552,6 +553,7 @@ static int audit_send_reply_thread(void *arg)
/* Ignore failure. It'll only happen if the sender goes away,
   because our timeout is set to infinite. */
netlink_unicast(aunet->nlsk , reply->skb, reply->portid, 0);
+   put_net(net);
kfree(reply);
return 0;
 }
@@ -583,8 +585,8 @@ static void audit_send_reply(__u32 portid, int seq, int 
type, int done,
if (!skb)
goto out;
 
+   reply->net = get_net(current->nsproxy->net_ns);
reply->portid = portid;
-   reply->pid = task_pid_vnr(current);
reply->skb = skb;
 
tsk = kthread_run(audit_send_reply_thread, reply, "audit_send_reply");
diff --git a/kernel/audit.h b/kernel/audit.h
index 57cc64d67718..8df132214606 100644
--- a/kernel/audit.h
+++ b/kernel/audit.h
@@ -247,7 +247,7 @@ extern void audit_panic(const char *message);
 
 struct audit_netlink_list {
__u32 portid;
-   pid_t pid;
+   struct net *net;
struct sk_buff_head q;
 };
 
diff --git a/kernel/auditfilter.c b/kernel/auditfilter.c
index 14a78cca384e..a5e3d73d73e4 100644
--- a/kernel/auditfilter.c
+++ b/kernel/auditfilter.c
@@ -29,6 +29,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "audit.h"
 
 /*
@@ -1083,8 +1084,8 @@ int audit_list_rules_send(__u32 portid, int seq)
dest = kmalloc(sizeof(struct audit_netlink_list), GFP_KERNEL);
if (!dest)
return -ENOMEM;
+   dest->net = get_net(current->nsproxy->net_ns);
dest->portid = portid;
-   dest->pid = task_pid_vnr(current);
skb_queue_head_init(&dest->q);
 
mutex_lock(&audit_filter_mutex);
-- 
1.7.5.4

--
Linux-audit mailing list
Linux-audit@redhat.com
https://www.redhat.com/mailman/listinfo/linux-audit


Re: [PATCH 11/12] pid: rewrite task helper functions avoiding task->pid and task->tgid

2013-08-28 Thread Eric W. Biederman
Peter Zijlstra  writes:

> On Mon, Aug 26, 2013 at 10:37:22PM -0400, Richard Guy Briggs wrote:
>> On Fri, Aug 23, 2013 at 08:36:21AM +0200, Peter Zijlstra wrote:
>> > Except that's not the case, with namespaces there's a clear hierarchy
>> > and the task_struct::pid is the one true value aka. root namespace.
>> 
>> Peter, I agonized over the access efficiency of dropping this one or the
>> duplicate in task_struct::pids and this one was far easier to drop in
>> terms of somehow always forcing
>> task->pids[PIDTYPE_PID].pid->numbers[0].nr to point to task->pid.
>
> You mean there's more than 1 site that sets task_struct::pid? I thought
> we only assign that thing once in fork.c someplace.

No there is not and that is not a concern.

Now I had thought given how the perf subsystem was constructed that only
the god like root was even allowed to use the code.  But it turns out
there is sysctl_perf_event_paranoid that can bet twiddled that in some
circumstance that unprivileged users are allowed to use perf.  Which
ultimately means perf will return the wrong data.

Meaning that perf is broken by design and perf has no excuse to be using
task->pid.  Similarly every other place in the kernel that has made the
same mistake.  I mention perf explicitly for two reasons.  perf gets the
namespace handling horribly wrong, perf is the only place in the kernel
where we are accessing pids frequently enough for an extra cache line
miss to be a concern.

When really pids in the kernel what we care about is not some stupid
number but the stuct pid which points to that tasks, process groups, and
sessions.  You know the object that a pid is the name for.

So yes as a long term direction task->pid and task->tgid need to be
killed because we keep getting subsystems like perf that return the
wrong data to userspace, or perform the wrong checks, because the
current structure makes it seem like it is ok to do the wrong thing.

Now that should not be Richard's fight.  Hopefully he can focus on
fixing audit.

> There's a few special cases, like the idle threads having pid-0 and
> 'simple' people like myself prefer to use task_struct::pid for debugging
> when we run our simple kernels without all this namespace stuff
> enabled.

Which is why a special printf format is likely a good idea because it
means you can easily print pids without needing to call ungainly helper
functions.

Of course you can't run kernels without this ``namespace'' stuff
enabled.  The best you can do is run kernels without the ability to
create new instances of the namespaces.

> The entire task->pids[PIDTYPE_PID].pid->numbers[0].nr thing just seems
> increddibly unwieldy and double dereferences, even if the lines are
> 'hot' are worse than single derefs.

But it is so much better than having to look up task->pid in a hash
table to get anything done, which is the previous state of affairs.

When the pid namespace support was merged except for a few overlooked
corner cases everything was converted except a bunch of printk
statements.  Now I look in the kernel and we have had subsystems added
that totally get the namespace handling wrong because it is easy and
apparently socially acceptable to mess up other peoples hard work.

Apparently even Linus yelling at people a few years back wasn't enough
for people to wake up and be responsible developers and use proper
abstractions.  So the only valid long term direction I can see is to
remove all of the abstractions that make getting pid handling wrong,
and to fix all of the code that gets it wrong today.  So that there are
no more bad examples in the kernel.

This isn't Richard's fight, and this isn't what needs to happen with
audit.  Audit just needs to be fixed so that so that it reports pid
numbers the audit daemon can make sense of, and to do that the audit
should use helper functions that are pid namespace aware and make it
clear that the proper pid namespace is being used.

In the long term ->pid and ->tgid must die, and take all of this wrong
think with it.

Eric

--
Linux-audit mailing list
Linux-audit@redhat.com
https://www.redhat.com/mailman/listinfo/linux-audit


Re: [Part1 PATCH 00/22] Add namespace support for audit

2013-06-27 Thread Eric W. Biederman
Aristeu Rozanski  writes:

> On Thu, Jun 20, 2013 at 03:01:09PM -0700, Eric W. Biederman wrote:
>> Gao feng  writes:
>> 
>> > On 06/20/2013 11:02 AM, Gao feng wrote:
>> >> If we don't tie audit to user namespace, there is still one problem.
>> >
>> > One more problem. some audit messages are generated by some net subsystem
>> > such as netfilter. If we don't tie audit to user namespace, we have no
>> > idea where these audit messages should go. there is no relationship between
>> > net namespace and audit namespace while we can get user namespace through
>> > net user namespace.
>> 
>> I am in favor of the user namespace tie in.
>> 
>> I am in favor of running a per user namespace audit filter once per user
>> namespace walking up the user namespace hierarchy.  Each filter would
>> deliver messages to a different userspace audit daemon.
>> 
>> Until we agreement to go that far I am not certain the kernel generated
>> audit messages should go anywhere except to the global audit daemon.
>> 
>> I think on an individual basis we can look at kernel audit messages and
>> see if they should go to just the global user namespace.  Just the user
>> namspace of the relevant network stack.  Or if the message should go to
>> the audit daemon of every user namespace that is an ancestor of some
>> starting user namespace.
>> 
>> But please let's error on the side of caution here.
>
> Let's say I'd like to use userns but still have a single and global
> auditd.

By definition the kernel auditd functionality is broken if we don't
allow a global auditd.  So that will be allowed.

> Dropping the capability will be required for that?

No.  Dropping capabilities and being in a state where you can never
interact with the primary audit context is required to safely have
access to a subordinate audit context.  Never being able to intereact
with the primary audit context removes all possibility of confusing
a privileged application and break the security model.

Entering a user namespace by definition drops all capabilities in a
non-recoverable way.  Which makes being in a user namespace a sufficient
condition to use a subordinate audit context.

> That sounds
> bad. The fact new user namespaces will want to control the separated
> audit namespace doesn't require tie in.

No.  The fact that you can gain root privileges by executing a suid root
application when not in a user namespace requires a tie in.


In summary.  A subordinate audit context is not safe if you can still
execute a suid root application and become root.  The interesting use
case is inside of a user namespace, which never allows gaining
capabilities outside of the user namespace.  Which makes a tie in with
user namespaces sensible, as it never allows subverting the current
mechanisms and it is where people want the functionality.

Eric

--
Linux-audit mailing list
Linux-audit@redhat.com
https://www.redhat.com/mailman/listinfo/linux-audit


Re: [Part1 PATCH 00/22] Add namespace support for audit

2013-06-21 Thread Eric W. Biederman
Daniel J Walsh  writes:

> Will I be able to use the audit namespace without the user namespace.  I would
> prefer to be able to use the audit namespace long before I am willing to take
> a chance with the User Namespace for things like light weight virtualization
> and securing processes with MAC.

I will be very surprised if we settle on a design that allows it.

I still think even the existence of multiple audit contexts is a little
iffy but the desire seems strong enough Gao feng will probably work
through all of the issues.

Without restricting processes to a user namespace at the same time as
restricting them to an audit context it becomes very easy to violate the
important audit policies and to bury user space generated messages from
privileged userspace applications.  At least in a user namespace we know
you are not privileged with respect to the rest of the system, so we
would only be dealing with userspace messages you would not be able to
generate otherwise.

As for taking a chance.  You will probably safer with a simultaneous use
of user namespaces and having processes secured with a MAC.  To the best
of my knowledge previous solutions have only been really safe when you
trusted the processes inside not to be malicious.  A user namespace at
least means you can stop using uid 0 inside of your light weight
virtualization.

Eric

--
Linux-audit mailing list
Linux-audit@redhat.com
https://www.redhat.com/mailman/listinfo/linux-audit


Re: [Part1 PATCH 00/22] Add namespace support for audit

2013-06-21 Thread Eric W. Biederman
Gao feng  writes:

> On 06/20/2013 11:02 AM, Gao feng wrote:
>> If we don't tie audit to user namespace, there is still one problem.
>
> One more problem. some audit messages are generated by some net subsystem
> such as netfilter. If we don't tie audit to user namespace, we have no
> idea where these audit messages should go. there is no relationship between
> net namespace and audit namespace while we can get user namespace through
> net user namespace.

I am in favor of the user namespace tie in.

I am in favor of running a per user namespace audit filter once per user
namespace walking up the user namespace hierarchy.  Each filter would
deliver messages to a different userspace audit daemon.

Until we agreement to go that far I am not certain the kernel generated
audit messages should go anywhere except to the global audit daemon.

I think on an individual basis we can look at kernel audit messages and
see if they should go to just the global user namespace.  Just the user
namspace of the relevant network stack.  Or if the message should go to
the audit daemon of every user namespace that is an ancestor of some
starting user namespace.

But please let's error on the side of caution here.

Eric

--
Linux-audit mailing list
Linux-audit@redhat.com
https://www.redhat.com/mailman/listinfo/linux-audit


Re: [Part1 PATCH 00/22] Add namespace support for audit

2013-06-20 Thread Eric W. Biederman
Eric Paris  writes:

> On Wed, 2013-06-19 at 16:49 -0400, Aristeu Rozanski wrote:
>> On Wed, Jun 19, 2013 at 09:53:32AM +0800, Gao feng wrote:
>> > This patchset is first part of namespace support for audit.
>> > in this patchset, the mainly resources of audit system have
>> > been isolated. the audit filter, rules havn't been isolated
>> > now. It will be implemented in Part2. We finished the isolation
>> > of user audit message in this patchset.
>> > 
>> > I choose to assign audit to the user namespace.
>> > Right now,there are six kinds of namespaces, such as
>> > net, mount, ipc, pid, uts and user. the first five
>> > namespaces have special usage. the audit isn't suitable to
>> > belong to these five namespaces, And since the flag of system
>> > call clone is in short supply, we can't provide a new flag such
>> > as CLONE_NEWAUDIT to enable audit namespace separately. so the
>> > user namespace may be the best choice.
>> 
>> I thought it was said on the last submission that to tie userns and
>> audit namespace would be a bad idea?
>
> I consider it a non-starter.  unpriv users are allowed to launch their
> own user namespace.  The whole point of audit is to have only a priv
> user be allowed to make changes.  If you tied audit namespace to user
> namespace you grant an unpriv user the ability to modify audit.
>
> NAK.
>
> If there are not clone flags you will either need to only do this from
> unshare and not from clone, or get more flags to clone

I completely agree that only priveleged user should be able to make
changes.

On the flip side, I don't know if this is at all interesting unless we
have a solution that works for users in unprivileged user namespaces.
Something like having the possibility of two or more instances of audit
working on every action.  One for each layer of privilege.

Gao feng, how do you want to use the audit infrastructure?

Eric

--
Linux-audit mailing list
Linux-audit@redhat.com
https://www.redhat.com/mailman/listinfo/linux-audit


Re: [PATCH] [BZ905179] audit: omit check for uid and gid validity in audit rules and data

2013-04-13 Thread Eric W. Biederman
Richard Guy Briggs  writes:

> On Tue, Apr 09, 2013 at 02:39:32AM -0700, Eric W. Biederman wrote:

>> @@ -377,6 +383,12 @@ static struct audit_entry *audit_rule_to_entry(struct 
>> audit_rule *rule)
>>  if (!gid_valid(f->gid))
>>  goto exit_free;
>>  break;
>> +case AUDIT_LOGINUID_SET:
>> +if ((f->op != Audit_not_equal) && (f->op != 
>> Audit_equal))
>> +goto exit_free;
>> +if ((f->val != 0) && (f->val != 1))
>
> Why the extra comparison to "1"?
>
> Are you anticipating already a userspace process making a call using the
> newof type AUDIT_LOGINUID_SET with a value of 1?

Sorry I missed this question the first time.  I am anticipating
AUDIT_LOGINUID_SET to return a value of 0 or 1 (a boolean) and so I
allow the operations and constants that are valid for a boolean.

In particuluar I allow the opeartions == !=  and the boolean constants 0 and 1.

>> @@ -1380,6 +1405,10 @@ static int audit_filter_user_rules(struct audit_krule 
>> *rule,
>>  result = 
>> audit_uid_comparator(audit_get_loginuid(current),
>>f->op, f->uid);
>>  break;
>> +case AUDIT_LOGINUID_SET:
>> +result = audit_comparator(audit_loginuid_set(current),
>> +  f->op, f->val);
>> +break;
>>  case AUDIT_SUBJ_USER:
>>  case AUDIT_SUBJ_ROLE:
>>  case AUDIT_SUBJ_TYPE:
>> diff --git a/kernel/auditsc.c b/kernel/auditsc.c
>> index 3a11d34..27d0a50 100644
>> --- a/kernel/auditsc.c
>> +++ b/kernel/auditsc.c
>> @@ -750,6 +750,9 @@ static int audit_filter_rules(struct task_struct *tsk,
>>  if (ctx)
>>  result = audit_uid_comparator(tsk->loginuid, 
>> f->op, f->uid);
>>  break;
>> +case AUDIT_LOGINUID_SET:
>> +result = audit_comparator(audit_loginuid_set(tsk), 
>> f->op, f->val);
>> +break;
>
> (OT: I assume the "if (ctx)" is wrong in the AUDIT_LOGINUID case
> above.)

Good question.  I didn't see that when I was preparing my patch.

ctx is not necessary but I think ctx is set when a task is being audited
so it may serve a useful function.  But I have to admit it that if(ctx)
looks like a bug.

Eric

--
Linux-audit mailing list
Linux-audit@redhat.com
https://www.redhat.com/mailman/listinfo/linux-audit


Re: [PATCH] [BZ905179] audit: omit check for uid and gid validity in audit rules and data

2013-04-10 Thread Eric W. Biederman
Steve Grubb  writes:

> On Tuesday, April 09, 2013 02:39:32 AM Eric W. Biederman wrote:
>> Andrew Morton  writes:
>> > On Wed, 20 Mar 2013 15:18:17 -0400 Richard Guy Briggs  
> wrote:
>> >> audit rule additions containing "-F auid!=4294967295" were failing with
>> >> EINVAL.
>> >> 
>> >> UID_INVALID (and GID_INVALID) is actually a valid uid (gid) for setting
>> >> and
>> >> testing against audit rules.  Remove the check for invalid uid and gid
>> >> when
>> >> parsing rules and data for logging.
>> 
>> In general testing against invalid uid appears completely bogus, and
>> should always return true.  As it is and essentially always has been
>> incorrect to explicitly set any kernel uid to that value.
>
> This is the unset value that daemons would have. 

As their uid, or gid, or euid, or fsuid.  Not in the least.

> When a real person logs in, 
> pam_loginuid writes the loginuid that was authenticated to. So, any time the 
> value is -1, we are dealing with a daemon or system process. When it comes to 
> auditing, people usually make an exception so that daemon and normal system 
> activity is not recorded. So, you would make a rule something like

> -a always,exit -S open -F exit=-EPERM -F auid>=500 -F auid!=-1

My point is that -1 is a special case that applies only to loginuid, and
that when testing for -1 is not testing for a specific loginuid value
but instead it is testing to see if loginuid has been set.  Semantically
the last is very different.

>> The only case where this appears to make the least little bit of sense
>> is if the goal of the test is to test to see if an audit logloginuid
>> has been set at all.  In which case depending on a test against
>> 4294967295 is bogus because you are depending on an intimate internal
>> kernel implementation detail.
>
> Its been this way and documented since at least 9 years ago. The audit system 
> has been broken for all intents and purposes since the 3.7 kernel was 
> released.

I certainly haven't seen the documentation.  And no one has much cared
about the audit subsystem this "breakage" of the audit
subsystem. Despite things failing with a clear error code.  So there are
two choices.  We mark the audit subsystem as broken moving it to staging
and then delete it because no one cares enough to look after it and
maintain it.  Or we have a constructive conversation about what to do
with it.

I have proposed a patch that will preserve the existing behavior while
adding maintainable semantics.  Will someone who cares please test my
proposed fix?

Eric

--
Linux-audit mailing list
Linux-audit@redhat.com
https://www.redhat.com/mailman/listinfo/linux-audit


Re: [PATCH] [BZ905179] audit: omit check for uid and gid validity in audit rules and data

2013-04-09 Thread Eric W. Biederman
Andrew Morton  writes:

> On Wed, 20 Mar 2013 15:18:17 -0400 Richard Guy Briggs  wrote:
>
>> audit rule additions containing "-F auid!=4294967295" were failing with 
>> EINVAL.
>> 
>> UID_INVALID (and GID_INVALID) is actually a valid uid (gid) for setting and
>> testing against audit rules.  Remove the check for invalid uid and gid when
>> parsing rules and data for logging.

In general testing against invalid uid appears completely bogus, and
should always return true.  As it is and essentially always has been
incorrect to explicitly set any kernel uid to that value.

The only case where this appears to make the least little bit of sense
is if the goal of the test is to test to see if an audit logloginuid
has been set at all.  In which case depending on a test against
4294967295 is bogus because you are depending on an intimate internal
kernel implementation detail.

Certainly removing the gid_valid tests is completely gratitious in this
case.

>> Revert part of ca57ec0f00c3f139c41bf6b0a5b9bcc95bbb2ad7 (2012-09-11) to fix
>> this.
>
> Eric, can you please take a look?
>
>> Signed-off-by: Richard Guy Briggs 
>> ---
>>  kernel/auditfilter.c |   12 
>>  1 files changed, 0 insertions(+), 12 deletions(-)
>> 
>> diff --git a/kernel/auditfilter.c b/kernel/auditfilter.c
>> index f9fc54b..457ee39 100644
>> --- a/kernel/auditfilter.c
>> +++ b/kernel/auditfilter.c
>> @@ -360,10 +360,7 @@ static struct audit_entry *audit_rule_to_entry(struct 
>> audit_rule *rule)
>>  /* bit ops not implemented for uid comparisons */
>>  if (f->op == Audit_bitmask || f->op == Audit_bittest)
>>  goto exit_free;
>> -
>>  f->uid = make_kuid(current_user_ns(), f->val);
>> -if (!uid_valid(f->uid))
>> -goto exit_free;
>
> It concerns me that map_id_down() can return -1 on error and that this
> change causes the kernel to no longer notice that error?

Me too.  Where we only communicate with audit in the initial user
namespace right now it isn't absolutely broken but it certainly isn't a
habit I want to get into.

How about something like my untested patch below that add an explicit
operation to test if loginuid has been set?

Eric

From: "Eric W. Biederman" 
Date: Tue, 9 Apr 2013 02:22:10 -0700
Subject: [PATCH] audit: Make testing for a valid loginuid explicit.

audit rule additions containing "-F auid!=4294967295" were failing
with EINVAL.

Apparently some userland audit rule sets want to know if loginuid uid
has been set and are using a test for auid != 4294967295 to determine
that.

In practice that is a horrible way to ask if a value has been set,
because it relies on subtle implementation details and will break
every time the uid implementation in the kernel changes.

So add a clean way to test if the audit loginuid has been set, and
silently convert the old idiom to the cleaner and more comprehensible
new idiom.

Reported-By: Richard Guy Briggs  wrote:
Signed-off-by: "Eric W. Biederman" 
---
 include/linux/audit.h  |5 +
 include/uapi/linux/audit.h |1 +
 kernel/auditfilter.c   |   29 +
 kernel/auditsc.c   |5 -
 4 files changed, 39 insertions(+), 1 deletions(-)

diff --git a/include/linux/audit.h b/include/linux/audit.h
index a9fefe2..8a1ddde 100644
--- a/include/linux/audit.h
+++ b/include/linux/audit.h
@@ -390,6 +390,11 @@ static inline void audit_ptrace(struct task_struct *t)
 #define audit_signals 0
 #endif /* CONFIG_AUDITSYSCALL */
 
+static inline bool audit_loginuid_set(struct task_struct *tsk)
+{
+   return uid_valid(audit_get_loginuid(tsk));
+}
+
 #ifdef CONFIG_AUDIT
 /* These are defined in audit.c */
/* Public API */
diff --git a/include/uapi/linux/audit.h b/include/uapi/linux/audit.h
index 9f096f1..9554a19 100644
--- a/include/uapi/linux/audit.h
+++ b/include/uapi/linux/audit.h
@@ -246,6 +246,7 @@
 #define AUDIT_OBJ_TYPE 21
 #define AUDIT_OBJ_LEV_LOW  22
 #define AUDIT_OBJ_LEV_HIGH 23
+#define AUDIT_LOGINUID_SET 24
 
/* These are ONLY useful when checking
 * at syscall exit time (AUDIT_AT_EXIT). */
diff --git a/kernel/auditfilter.c b/kernel/auditfilter.c
index 540f986..6381d17 100644
--- a/kernel/auditfilter.c
+++ b/kernel/auditfilter.c
@@ -349,6 +349,12 @@ static struct audit_entry *audit_rule_to_entry(struct 
audit_rule *rule)
if (f->op == Audit_bad)
goto exit_free;
 
+   /* Support legacy tests for a valid loginuid */
+   if ((f->type == AUDIT_LOGINUID) && (f->val == 42949672

Re: [PATCH RFC] audit: provide namespace information in user originated records

2013-03-21 Thread Eric W. Biederman
Serge Hallyn  writes:

> Quoting Eric Paris (epa...@redhat.com):
>> So the kernel socket(s) would be per network namespace, but we divide
>> messages per user namespace?  Which socket do I send them on,
>> considering the possible crazy many<->many mappings between user and
>> network namespaces.  It all makes me cry a little.
>
> not many-many - each netns is owned by exactly one userns.  The userns
> from which the netns was created.

Doh.  I missed this question and I think I misunderstood when Eric
Paris was talking about multicasting audit messages.

If what we are really talking about is sending some audit messages to
an auditd in a container what appears obvious to me is that we define
a per user namespace capability something like CAP_AUDIT_CONTROL.  That
does most or all of what CAP_AUDIT_CONTROL does in the init user
namespace.  Especially capturing audit_pid and audit_nlk_portid to
decide who to send the message to.

Something like:

struct audit_control {
int initialized;
pid_t   pid;
u32 nlk_portid;
};

struct user_namespace {
   ...
   struct audit_contol audit;
};

Then the transmission would be something like:
struct user_namespace *user_ns = ...;
for (;;) {
if (ns->audit_pid) {
err = netlink_unicast(ns->audit.sock, skb, 
ns->audit.nlk_portid, 0);
}
if (!ns->parent)
break;
ns = ns->parent;
}

If someone finds auditd interesting enough to do that work.

In general I think it only makes sense if we can reuse the existing
userspace auditd.

Eric

--
Linux-audit mailing list
Linux-audit@redhat.com
https://www.redhat.com/mailman/listinfo/linux-audit


Re: [PATCH RFC] audit: provide namespace information in user originated records

2013-03-20 Thread Eric W. Biederman
Eric Paris  writes:

> On Wed, 2013-03-20 at 13:49 -0500, Serge Hallyn wrote:
>> Quoting Eric Paris (epa...@redhat.com):
>> > On Wed, 2013-03-20 at 13:36 -0500, Serge Hallyn wrote:
>> > > Quoting Aristeu Rozanski (aroza...@redhat.com):
>> > > > This is a bit fuzzy to me, perhaps due I'm not fully understanding
>> > > > userns implementation yet, so bear with me:
>> > > > I thought of changing so userns would not grant CAP_AUDIT_WRITE and
>> > > > CAP_AUDIT_CONTROL unless the process already has it (i.e. it'd require
>> > > 
>> > > Seems like CAP_AUDIT_WRITE should be targeted against the
>> > > skb->netns->userns.  Then CAP_AUDIT_WRITE can be treated like any other
>> > > capability.  Last I knew (long time ago) you had to be in init_user_ns
>> > > to talk audit, but that's ok - this would just do the right thing in
>> > > any case.
>> > 
>> > kauditd should be considered as existing in the init user namespace.  So
>> > I'd think we'd want to check if the process had CAP_AUDIT_WRITE in the
>> > init user namespace and if so, allow it to send messages.  Who care what
>> > *ns the process exists in.  If it has it in the init namespace, go
>> > ahead.  Thus the process that created the container would need
>> > CAP_AUDIT_WRITE in the init namespace for this to all work, right?
>> 
>> Yes.  What I was suggesting is intended to work if that situation ever
>> changes.  But I have zero complaints about doing it as you say, as I
>> doubt it ever will/ought to change.
>> 
>> That basically means CAP_AUDIT_WRITE would be worthless in a non-init
>> userns.  That's fine - at least the rules would be consistent.
>
> [veering away from this particular patch]
>
> We are also talking about adding a CAP_AUDIT_READ and sending messages
> via multicast on the audit socket.  The problem is I don't know how the
> audit socket could work in the network namespace world.

Hmm.  I don't quite know how CAP_AUDIT_READ could work.  When delivering
a message to a socket you really don't know who is on the other end.

> Right now kauditd has:
>
> audit_sock = netlink_kernel_create(&init_net, NETLINK_AUDIT, &cfg);
>
> So there won't ever be anything on the kernel side of the audit socket
> in a non-init network namespace.  Lets say that is fixed somehow (I
> assume it's possible?  something? magic pixies?)

One socket for each network namespace...  It is a pain but doable.

> I think we'd somehow
> need to do the CAP_AUDIT_READ check against the user namespace
> associated with the network namespace in question?  But what messages
> should go to this userspace auditd?

Messages generated by processes in that user namespace?  

> Going to have to have audit namespaces to.  But only CAP_AUDIT_READ
> would make sense in the new audit namespace...

Given the connection of audit and security I think if we add support for
a non-global auditd the user namespace seems to fit.  The user namespace
is certainly where all of the security connected bits go.

Architecturally it gets a little tricky as it seems to make sense to
generate audit messages that make sense to the process receiving them,
which would mean actually generating a different audit message for
different receiving contexts.

I find the auditsc code odd.  We log file descriptor numbers when a file
is mmaped?  What is something so process relative good to anyone?

On a slightly different tangent.  Do we want to update the AUDIT_CAPSET
message to report the user namespace whose caps we are changing or
perhaps to surpress the message outside of the initial user namespace.

Eric

--
Linux-audit mailing list
Linux-audit@redhat.com
https://www.redhat.com/mailman/listinfo/linux-audit


Re: [PATCH RFC] audit: provide namespace information in user originated records

2013-03-20 Thread Eric W. Biederman
Aristeu Rozanski  writes:

> On Mon, Mar 18, 2013 at 03:16:52PM -0700, Eric W. Biederman wrote:
>> Adding the containers list so folks with container expertise can see
>> what is being proposed.
>> 
>> Aristeu Rozanski  writes:
>> 
>> > This patchset introduces a new audit record to follow all USER records 
>> > which
>> > provides namespace information of the process. The idea is to allow 
>> > processes
>> > in containers to create records in the host system while providing means 
>> > to be
>> > filtered out.
>> 
>> It looks like this mechanism makes it easy for an unprivileged program
>> to spam and overwhelm the audit log.
>> 
>> > For each new namespace, a unique procfs inode number is allocated and this
>> > number has been used by userspace to determine which processes belong to 
>> > the
>> > same namespace. These numbers are used in the new audit record.
>> >
>> > Applications such as libvirt-sandbox and lxc can then report the same 
>> > numbers
>> > when a container is created and destroyed allowing to map records to a 
>> > certain
>> > container. Maybe the next step would be having a record for whenever a new
>> > namespace is created?
>> >
>> > First 6 patches are needed in order to get each namespace's inode number.
>> 
>> Grumble the existing methods can be used you don't have to introduce a
>> whole new set of methods.  Grumble.  Besides the bug of assuming that
>> the inodes now and forever will be the same across all instances of
>> proc.
>
> the existing methods are for procfs use and I didn't want to abuse it.
> like I said the other email, the fact that it's not a reliable way to
> indefinitely describe a namespace due to multiple procfs instances or
> migration, the whole idea is flawed.

It is always possible to pick the instance of /proc connected to the
initial pid namespace.  And there is a device number you can use to say
that.

Usually designs that need global identifiers for namespaces suffer from
the need for a namespace of namespaces (which we sort of have in /proc),
and I push back by default to get people to think if what they are
trying to do really makes sense.

>> > Patch 7 properly defines the new record that is related to the USER
>> > record
>> 
>> Not agmenting the current user records seems a little odd to me.
>> 
>> You also continue in this my current policy of not allowing any audit
>> records in the container itself, so I a don't quite know what the point
>> of all of this is.
>
> your current policy wasn't known to me and
>   /* Only support the initial namespaces for now. */
> sounds like something that didn't happen for other reasons

The reasons were simply that to my knowledge no one has thought through
how audit records and namespaces make sense to interact.

My expectation would be that an extention of audit records would be
logged on a per container basis.  But I don't have any motivating
examples.

>> > Patch 8 allows USER records to be generated from different namespaces
>> 
>> Which essentially allows any user to create any USER record they want
>> whenever they want.
>> 
>> > Here's an example of output:
>> > type=CRED_DISP msg=audit(1363528861.403:311): pid=20016 uid=0 auid=0 
>> > ses=45 subj=system_u:system_r:crond_t:s0-s0:c0.c1023 msg='op=PAM:setcred 
>> > acct="root" exe="/usr/sbin/crond" hostname=? addr=? terminal=cron 
>> > res=success'
>> 
>> Ok.  This seems totally bizarre.  You are running a container with a
>> user namespace with some uid mapped to uid 0?
>
> on the notes section:
>   - while the last patch allows a new userns to send audit records, I 
> haven't
> look yet on making sure it has proper capabilities so regular users'
> containers can create records
>
> so I haven't tried it with userns. It's a RFC. 

I though you would have taken the time to run it at least once, or to
perhaps have manually edited your example to see how things would fit
together.

> That's a regular record
> to show the related records, using initial namespaces. like I stated in
> the email, I wasn't sure how I'd handle capabilities but the idea would be
> to allow containers to log to the system's auditd. since inode numbers
> aren't more reliable for more than a moment, I guess there's no other
> way than having an audit namespace and run an audit daemon inside the
> container (and communicate over the network like an indi

Re: [PATCH RFC 8/8] audit: allow user records to be created inside a container

2013-03-19 Thread Eric W. Biederman
Aristeu Rozanski  writes:

> Since user events will be followed by namespace information, userspace
> can filter off undesired container records.

I don't think we want to allow any user to write to the audit records,
that is what nsown_capable will allow, as all you would need to do is to
unshare the user namespace to be able to write audit records.

Eric

> @@ -597,13 +612,13 @@ static int audit_netlink_ok(struct sk_buff *skb, u16 
> msg_type)
>   case AUDIT_TTY_SET:
>   case AUDIT_TRIM:
>   case AUDIT_MAKE_EQUIV:
> - if (!capable(CAP_AUDIT_CONTROL))
> + if (!nsown_capable(CAP_AUDIT_CONTROL))
>   err = -EPERM;
>   break;
>   case AUDIT_USER:
>   case AUDIT_FIRST_USER_MSG ... AUDIT_LAST_USER_MSG:
>   case AUDIT_FIRST_USER_MSG2 ... AUDIT_LAST_USER_MSG2:
> - if (!capable(CAP_AUDIT_WRITE))
> + if (!nsown_capable(CAP_AUDIT_WRITE))
>   err = -EPERM;
>   break;
>   default:  /* bad msg */

--
Linux-audit mailing list
Linux-audit@redhat.com
https://www.redhat.com/mailman/listinfo/linux-audit


Re: [PATCH RFC] audit: provide namespace information in user originated records

2013-03-19 Thread Eric W. Biederman

Adding the containers list so folks with container expertise can see
what is being proposed.

Aristeu Rozanski  writes:

> This patchset introduces a new audit record to follow all USER records which
> provides namespace information of the process. The idea is to allow processes
> in containers to create records in the host system while providing means to be
> filtered out.

It looks like this mechanism makes it easy for an unprivileged program
to spam and overwhelm the audit log.

> For each new namespace, a unique procfs inode number is allocated and this
> number has been used by userspace to determine which processes belong to the
> same namespace. These numbers are used in the new audit record.
>
> Applications such as libvirt-sandbox and lxc can then report the same numbers
> when a container is created and destroyed allowing to map records to a certain
> container. Maybe the next step would be having a record for whenever a new
> namespace is created?
>
> First 6 patches are needed in order to get each namespace's inode number.

Grumble the existing methods can be used you don't have to introduce a
whole new set of methods.  Grumble.  Besides the bug of assuming that
the inodes now and forever will be the same across all instances of
proc.

> Patch 7 properly defines the new record that is related to the USER
> record

Not agmenting the current user records seems a little odd to me.

You also continue in this my current policy of not allowing any audit
records in the container itself, so I a don't quite know what the point
of all of this is.

> Patch 8 allows USER records to be generated from different namespaces

Which essentially allows any user to create any USER record they want
whenever they want.

> Here's an example of output:
> type=CRED_DISP msg=audit(1363528861.403:311): pid=20016 uid=0 auid=0 ses=45 
> subj=system_u:system_r:crond_t:s0-s0:c0.c1023 msg='op=PAM:setcred acct="root" 
> exe="/usr/sbin/crond" hostname=? addr=? terminal=cron res=success'

Ok.  This seems totally bizarre.  You are running a container with a
user namespace with some uid mapped to uid 0?

That defeats about half the point of having user namespaces, as half the
files in the world are owned by uid 0, and can be written by uid 0
outside of your user namespace.

Hmm.  I need to look at this in a little more detail but I believe our
use of task_pid_vnr here in the audit record is a long standing bug.

> type=UNKNOWN[1327] msg=audit(1363528861.403:311): mnt=4026531840 
> net=4026531956 uts=4026531838 ipc=4026531839 pid=4026531836 user=4026531837
>
> Notes:
> - this is a RFC, all sorts of feedback are much appreciated
> - while the last patch allows a new userns to send audit records, I haven't
>   look yet on making sure it has proper capabilities so regular users'
>   containers can create records

I don't think it does.

> - the record number allocated is just a draft. If this patchset evolves into
>   something that can be merged, please advise which number number is the best
>   choice
> - I'm not subscribed to the list, so please make sure I'm on the Cc list
>
>  fs/namespace.c |   14 +++
>  include/linux/ipc_namespace.h  |1
>  include/linux/mnt_namespace.h  |2 +
>  include/linux/pid_namespace.h  |1
>  include/linux/user_namespace.h |1
>  include/linux/utsname.h|1
>  include/net/net_namespace.h|1
>  include/uapi/linux/audit.h |1
>  ipc/namespace.c|   14 +++
>  kernel/audit.c |   76 
> +
>  kernel/pid_namespace.c |   11 +
>  kernel/user_namespace.c|5 ++
>  kernel/utsname.c   |   14 +++
>  net/core/net_namespace.c   |   14 +++
>  14 files changed, 150 insertions(+), 6 deletions(-)

--
Linux-audit mailing list
Linux-audit@redhat.com
https://www.redhat.com/mailman/listinfo/linux-audit


Re: [PATCH RFC 7/8] audit: report namespace information along with USER events

2013-03-19 Thread Eric W. Biederman
Aristeu Rozanski  writes:

> For userspace generated events, include a record with the namespace
> procfs inode numbers the process belongs to. This allows to track down
> and filter audit messages by userspace.

I am not comfortable with using the inode numbers this way.  It does not
pass the test of can I migrate a container and still have this work
test.  Any kind of kernel assigned name for namespaces fails that test.

I also don't like that you don't include the procfs device number.  An
inode number means nothing without knowing which filesystem you are
referring to.

It may never happen but I reserve the right to have the inode numbers
for namespaces to show up differently in different instances of procfs.

Beyond that I think this usage is possibly buggy by using two audit
records for one event.

> Signed-off-by: Aristeu Rozanski 
> ---
>  include/uapi/linux/audit.h |1 +
>  kernel/audit.c |   51 
> +++-
>  2 files changed, 51 insertions(+), 1 deletions(-)
>
> diff --git a/include/uapi/linux/audit.h b/include/uapi/linux/audit.h
> index 9f096f1..3ec3ccb 100644
> --- a/include/uapi/linux/audit.h
> +++ b/include/uapi/linux/audit.h
> @@ -106,6 +106,7 @@
>  #define AUDIT_NETFILTER_PKT  1324/* Packets traversing netfilter chains 
> */
>  #define AUDIT_NETFILTER_CFG  1325/* Netfilter chain modifications */
>  #define AUDIT_SECCOMP1326/* Secure Computing event */
> +#define AUDIT_USER_NAMESPACE 1327/* Information about process' 
> namespaces */
>  
>  #define AUDIT_AVC1400/* SE Linux avc denial or grant */
>  #define AUDIT_SELINUX_ERR1401/* Internal SE Linux Errors */
> diff --git a/kernel/audit.c b/kernel/audit.c
> index 58db117..b17f9c0 100644
> --- a/kernel/audit.c
> +++ b/kernel/audit.c
> @@ -62,6 +62,11 @@
>  #include 
>  #include 
>  #include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
>  
>  #include "audit.h"
>  
> @@ -641,6 +646,49 @@ static int audit_log_common_recv_msg(struct audit_buffer 
> **ab, u16 msg_type,
>   return rc;
>  }
>  
> +#ifdef CONFIG_NAMESPACES
> +static int audit_log_namespaces(struct task_struct *tsk,
> + struct sk_buff *skb)
> +{
> + struct audit_context *ctx = tsk->audit_context;
> + struct audit_buffer *ab;
> +
> + if (!audit_enabled)
> + return 0;
> +
> + ab = audit_log_start(ctx, GFP_KERNEL, AUDIT_USER_NAMESPACE);
> + if (unlikely(!ab))
> + return -ENOMEM;
> +
> + audit_log_format(ab, "mnt=%u", mntns_get_inum(tsk));
> +#ifdef CONFIG_NET_NS
> + audit_log_format(ab, " net=%u", netns_get_inum(tsk));
> +#endif
> +#ifdef CONFIG_UTS_NS
> + audit_log_format(ab, " uts=%u", utsns_get_inum(tsk));
> +#endif
> +#ifdef CONFIG_IPC_NS
> + audit_log_format(ab, " ipc=%u", ipcns_get_inum(tsk));
> +#endif
> +#ifdef CONFIG_PID_NS
> + audit_log_format(ab, " pid=%u", pidns_get_inum(tsk));
> +#endif
> +#ifdef CONFIG_USER_NS
> + audit_log_format(ab, " user=%u", userns_get_inum(tsk));
> +#endif  
> + audit_set_pid(ab, NETLINK_CB(skb).portid);
> + audit_log_end(ab);
> +
> + return 0;
> +}
> +#else
> +static inline int audit_log_namespaces(struct task_struct *tsk,
> +struct sk_buff *skb)
> +{
> + return 0;
> +}
> +#endif
> +
>  static int audit_receive_msg(struct sk_buff *skb, struct nlmsghdr *nlh)
>  {
>   u32 seq, sid;
> @@ -741,7 +789,7 @@ static int audit_receive_msg(struct sk_buff *skb, struct 
> nlmsghdr *nlh)
>   }
>   audit_log_common_recv_msg(&ab, msg_type,
> loginuid, sessionid, sid,
> -   NULL);
> +   current->audit_context);
>  
>   if (msg_type != AUDIT_USER_TTY)
>   audit_log_format(ab, " msg='%.1024s'",
> @@ -758,6 +806,7 @@ static int audit_receive_msg(struct sk_buff *skb, struct 
> nlmsghdr *nlh)
>   }
>   audit_set_pid(ab, NETLINK_CB(skb).portid);
>   audit_log_end(ab);
> + audit_log_namespaces(current, skb);
>   }
>   break;
>   case AUDIT_ADD:

--
Linux-audit mailing list
Linux-audit@redhat.com
https://www.redhat.com/mailman/listinfo/linux-audit