Re: [RFC PATCH] seccomp: Add protection keys into seccomp_data
On Tue, Nov 15, 2022 at 5:16 AM Michael Sammler wrote: > > We're currently working on a feature in chromium that uses pkeys for > > in-process isolation. Being able to use the pkey state in the seccomp > > filter would be pretty useful for this. For example, it would allow > > us to enforce that no code outside the isolated thread would ever > > map/mprotect executable memory. > > We can probably do something similar by adding instruction pointer > > checks to the seccomp filter, but that feels quite hacky and this > > feature would make a much nicer implementation. > > > > Are there any plans to make a version 2 of this patch? > > Thanks for your interest in this patch, but I am now working on other > projects and currently don't plan to make a version 2 of this patch. I'd be happy to take over writing a version 2 for this. Kees and Dave, does this feature overall look good to you? >From the discussion, I think there are two proposed changes: * use an architecture-generic interface as Ram Pai suggested (i.e. add a read_pkey function) * ensure to restore the pkru value or fetch it from the xsave buffer smime.p7s Description: S/MIME Cryptographic Signature
Re: [RFC PATCH] seccomp: Add protection keys into seccomp_data
> We're currently working on a feature in chromium that uses pkeys for > in-process isolation. Being able to use the pkey state in the seccomp > filter would be pretty useful for this. For example, it would allow > us to enforce that no code outside the isolated thread would ever > map/mprotect executable memory. > We can probably do something similar by adding instruction pointer > checks to the seccomp filter, but that feels quite hacky and this > feature would make a much nicer implementation. > > Are there any plans to make a version 2 of this patch? Thanks for your interest in this patch, but I am now working on other projects and currently don't plan to make a version 2 of this patch.
Re: [RFC PATCH] seccomp: Add protection keys into seccomp_data
> Add the current value of an architecture specific protection keys > register (currently PKRU on x86) to data available for seccomp-bpf > programs to work on. This allows filters based on the currently > enabled protection keys. We're currently working on a feature in chromium that uses pkeys for in-process isolation. Being able to use the pkey state in the seccomp filter would be pretty useful for this. For example, it would allow us to enforce that no code outside the isolated thread would ever map/mprotect executable memory. We can probably do something similar by adding instruction pointer checks to the seccomp filter, but that feels quite hacky and this feature would make a much nicer implementation. Are there any plans to make a version 2 of this patch? smime.p7s Description: S/MIME Cryptographic Signature
Re: [RFC PATCH] seccomp: Add protection keys into seccomp_data
On 10/29/2018 11:33 PM, Dave Hansen wrote: But, that's really an implementation detail. The effect on the ABI and how this might constrain future pkeys use is my bigger worry. I'd also want to make sure that your specific use-case is compatible with all the oddities of pkeys, like the 'clone' and signal behavior. Some of that is spelled out here: http://man7.org/linux/man-pages/man7/pkeys.7.html I think my specific use case is compatible with these oddities since the untrusted code is not allowed to install signal handlers or spawn threads himself but only through the trusted code, which ensures, that the PKRU has the correct value when control passes back to the untrusted code. But I agree that these oddities are something a programmer needs to take into account if he uses pkeys in general (and this patch adds new considerations to it if the program installs a seccomp filter and e.g. wants to do system calls from a signal handler). One thing that's a worry is that we have never said that you *can't* write to arbitrary permissions in PKRU. I can totally see some really paranoid code saying, "I'm about to do something risky, so I'll turn off access to *all* pkeys", or " turn off all access except my current stack". If they did that, they might also inadvertently disable access to certain seccomp-restricted syscalls. We can fix that up by documenting restrictions like "code should never change the access rights of any pkey other than those that it allocated", but that doesn't help any old code (of which I hope there is relatively little). I can also see paranoid code wanting to do something like you describe and I think, that we should try to not forbid this behaviour. Documenting restrictions on code which writes to the PKRU as you describe would be one way to go, but this would disallow this paranoid use case if I understand it correctly because the code would not be allowed to disable access to all except of one pkey. My idea would be to not put the blame on the code which writes to the PKRU but on the code which installs the seccomp filter based on the PKRU: It has to make sure, that it allows the system calls which the paranoid code should be allowed to do when the pkey of the paranoid code is the (only) enabled pkey. This could be ensured e.g. by the paranoid code providing the pkey it intends to use to the code which installs the seccomp filter. This of course means, that you need to know what you are doing, when you install a seccomp filter which depends on the PKRU, but I think this is already the case when you install a seccomp filter without this patch (but maybe someone with more seccomp experience than me can say better how much reasoning is needed to install a seccomp filter without and with this patch). -- Michael
Re: [RFC PATCH] seccomp: Add protection keys into seccomp_data
On 10/29/18 2:55 PM, Michael Sammler wrote: >> PKRU getting reset on signals, and the requirement now that it *can't* >> be changed if you make syscalls probably needs to get thought about very >> carefully before we do this, though. > I am not sure, whether I follow you. Are you saying, that PKRU is > currently not guaranteed to be preserved across system calls? > This would make it very hard to use protection keys if libc does not > save and restore the PKRU before/after systemcalls (and I am not aware > of this). It's preserved *across* system calls, but you have to be a bit careful using it _inside_ the kernel. We could context switch off to something else, and not think that we need to restore PKRU until _just_ before we return to userspace. > Or do you mean, that the kernel might want to use the PKRU register for > its own purposes while it is executing? That, or we might keep another process's PKRU state in the register if we don't think anyone is using it. Now that I think about it, I think Rik (cc'd, who was working on those patches) *had* to explicitly restore PKRU because it's hard to tell where we might do a copy_to/from_user() and need it. > Then the solution you proposed in another email in this thread would > work: instead of providing the seccomp filter with the current value of > the PKRU (which might be different from what the user space expects) use > the user space value which must have been saved somewhere (otherwise it > would not be possible to restore it). Yep, that's the worst-case scenario: either fetch PKRU out of the XSAVE buffer (current->fpu->something), or just restore them using an existing API before doing RDPKRU. But, that's really an implementation detail. The effect on the ABI and how this might constrain future pkeys use is my bigger worry. I'd also want to make sure that your specific use-case is compatible with all the oddities of pkeys, like the 'clone' and signal behavior. Some of that is spelled out here: http://man7.org/linux/man-pages/man7/pkeys.7.html One thing that's a worry is that we have never said that you *can't* write to arbitrary permissions in PKRU. I can totally see some really paranoid code saying, "I'm about to do something risky, so I'll turn off access to *all* pkeys", or " turn off all access except my current stack". If they did that, they might also inadvertently disable access to certain seccomp-restricted syscalls. We can fix that up by documenting restrictions like "code should never change the access rights of any pkey other than those that it allocated", but that doesn't help any old code (of which I hope there is relatively little).
Re: [RFC PATCH] seccomp: Add protection keys into seccomp_data
Am 29.10.2018 um 18:29 schrieb Dave Hansen: On 10/29/18 9:48 AM, Jann Horn wrote: On Mon, Oct 29, 2018 at 5:37 PM Dave Hansen wrote: I'm not sure this is a great use for PKRU. I *think* the basic problem is that you want to communicate some rights information down into a filter, and you want to communicate it with PKRU. While it's handy to have an extra register that nobody (generally) mucks with, I'm not quite convinced that we want to repurpose it this way. That's not how I understand it; I believe that the context is probably https://arxiv.org/pdf/1801.06822.pdf ? My understanding is that PKRU is used for lightweight in-process sandboxing, and to extend this sandbox protection to the syscall interface, it is necessary to expose PKRU state to seccomp filters. In other words, this isn't using PKRU exclusively for passing rights into a filter, but it has to use PKRU anyway. PKRU gives information about rights to various bits of application data. From that, a seccomp filter can infer the context, and thus the ability for the code to call a given syscall at a certain point in time. This makes PKRU an opt-in part of the syscall ABI, which is pretty interesting. We _could_ do the same kind of thing with any callee-saved general purpose register, but PKRU is particularly attractive because there is only one instruction that writes to it (well, outside of XSAVE*), and random library code is very unlikely at this point to be using it. I agree with you on the points, why PKRU is particularly attractive but I think the most important point is that PKRU is _not_ a general purpose register, but is already used to control access to some resource (memory). This patch would allow to also control access to another resource (system calls) using the PKRU. This is why it makes sense to use the PKRU in this patch instead of another callee-saved register. PKRU getting reset on signals, and the requirement now that it *can't* be changed if you make syscalls probably needs to get thought about very carefully before we do this, though. I am not sure, whether I follow you. Are you saying, that PKRU is currently not guaranteed to be preserved across system calls? This would make it very hard to use protection keys if libc does not save and restore the PKRU before/after systemcalls (and I am not aware of this). Or do you mean, that the kernel might want to use the PKRU register for its own purposes while it is executing? Then the solution you proposed in another email in this thread would work: instead of providing the seccomp filter with the current value of the PKRU (which might be different from what the user space expects) use the user space value which must have been saved somewhere (otherwise it would not be possible to restore it). Or are you afraid, that one part of a user space program installs a seccomp filter, which blocks system calls based on the PKRU, and another part of the same program (maybe a library) changes the PKRU in a way, which the first part did not expect and the program dies because it tries to do a forbidden system call? I don't know whether the kernel can (and wants) do anything against this. This problem also exists without this patch if you replace system call with memory access. -- Michael
Re: [RFC PATCH] seccomp: Add protection keys into seccomp_data
On 10/29/18 9:48 AM, Jann Horn wrote: > On Mon, Oct 29, 2018 at 5:37 PM Dave Hansen wrote: >> I'm not sure this is a great use for PKRU. I *think* the basic problem >> is that you want to communicate some rights information down into a >> filter, and you want to communicate it with PKRU. While it's handy to >> have an extra register that nobody (generally) mucks with, I'm not quite >> convinced that we want to repurpose it this way. > > That's not how I understand it; I believe that the context is probably > https://arxiv.org/pdf/1801.06822.pdf ? > My understanding is that PKRU is used for lightweight in-process > sandboxing, and to extend this sandbox protection to the syscall > interface, it is necessary to expose PKRU state to seccomp filters. > In other words, this isn't using PKRU exclusively for passing rights > into a filter, but it has to use PKRU anyway. PKRU gives information about rights to various bits of application data. From that, a seccomp filter can infer the context, and thus the ability for the code to call a given syscall at a certain point in time. This makes PKRU an opt-in part of the syscall ABI, which is pretty interesting. We _could_ do the same kind of thing with any callee-saved general purpose register, but PKRU is particularly attractive because there is only one instruction that writes to it (well, outside of XSAVE*), and random library code is very unlikely at this point to be using it. PKRU getting reset on signals, and the requirement now that it *can't* be changed if you make syscalls probably needs to get thought about very carefully before we do this, though.
Re: [RFC PATCH] seccomp: Add protection keys into seccomp_data
On 10/29/18 10:02 AM, Michael Sammler wrote: >>> Also, I'm not sure the kernel provides the PKRU guarantees you want at >>> the moment. Our implementation *probably* works, but it's mostly by >>> accident. > I don't know, which guarantees about the PKRU are provided at the > moment, but the only guarantee needed for this patch is, that the kernel > does not change the bits of the PKRU register, which belong to pkeys > allocated by the user program, between the syscall entry and the call to > secure_computing(). Is there are use case where the kernel would like to > modify these bits of the PKRU? We've been talking about doing more lax save/restore of the XSAVE content (PKRU is part of this content). We would, for instance, only restore it when returning to userspace, but PKRU might not be up-to-date with the value in current->fpu. It's not a deal-breaker with your approach, it's just something to be careful of and make sure PKRU is up-to-date before you go use it.
Re: [RFC PATCH] seccomp: Add protection keys into seccomp_data
On 10/29/2018 05:48 PM, Ram Pai wrote: On Mon, Oct 29, 2018 at 09:25:15AM -0700, Kees Cook wrote: On Mon, Oct 29, 2018 at 4:23 AM, Michael Sammler wrote: Add the current value of an architecture specific protection keys register (currently PKRU on x86) to data available for seccomp-bpf programs to work on. This allows filters based on the currently enabled protection keys. Support for protection keys on the POWER architecture is not part of this patch since I do not have access to a PowerPC, but adding support for it can be achieved by setting sd->pkeys to the AMR register in populate_seccomp_data. Maybe you can use a generic read_pkey() function, and each arch can provide their own implementation. In the case of powerpc it is mfspr(SPRN_AMR); something like the code below might help? diff --git a/arch/powerpc/include/asm/pkeys.h b/arch/powerpc/include/asm/pkeys.h index 92a9962..d79efe3 100644 --- a/arch/powerpc/include/asm/pkeys.h +++ b/arch/powerpc/include/asm/pkeys.h @@ -89,6 +89,11 @@ static inline u16 pte_to_pkey_bits(u64 pteflags) #define __mm_pkey_is_reserved(pkey) (reserved_allocation_mask & \ pkey_alloc_mask(pkey)) +static inline u64 read_pkey() +{ +return mfspr(SPRN_AMR); +} + static inline bool mm_pkey_is_allocated(struct mm_struct *mm, int pkey) { if (pkey < 0 || pkey >= arch_max_pkey()) diff --git a/arch/x86/include/asm/pkeys.h b/arch/x86/include/asm/pkeys.h index 19b137f..feea114 100644 --- a/arch/x86/include/asm/pkeys.h +++ b/arch/x86/include/asm/pkeys.h @@ -9,6 +9,11 @@ extern int arch_set_user_pkey_access(struct task_struct *tsk, int pkey, unsigned long init_val); +static inline u64 read_pkey() +{ +return read_pkru(); +} + static inline bool arch_pkeys_enabled(void) { return boot_cpu_has(X86_FEATURE_OSPKE); diff --git a/include/linux/pkeys.h b/include/linux/pkeys.h index 2955ba97..f1538c7 100644 --- a/include/linux/pkeys.h +++ b/include/linux/pkeys.h @@ -13,6 +13,11 @@ #define PKEY_DEDICATED_EXECUTE_ONLY 0 #define ARCH_VM_PKEY_FLAGS 0 +static inline u64 read_pkey() +{ + return 0; +} + static inline int vma_pkey(struct vm_area_struct *vma) { return 0; RP Thank you very much, this helps indeed. I will add this to the next version of this patch. One use case for this patch is disabling unnecessary system calls for a library (e.g. network i/o for a crypto library) while the library runs without disabling the system calls for the whole program (by changing the protection keys before and after the library executes). Using this one could ensure that the library behaves a expected (e.g. the crypto library not sending private keys to a malicious server). This patch also enables lightweight sandboxing of untrusted code using memory protection keys: Protection keys provide memory isolation but for a sandbox system call isolation is needed as well. This patch allows writing a seccomp filter to prevent system calls by the untrusted code while still allowing system calls for the trusted code. Isn't PKU instruction based? Couldn't a malicious library just change the state of the MPK registers? This seems like an easy way to bypass any filters that used PKU. I'm not convinced this is a meaningful barrier that should be enforced by seccomp. This can also be done with a signal handler with SECCOMP_RET_TRAP and check instruction pointer vs PKU state. -Kees An alternative design would be to extend (c)BPF with a new instruction to read the state of a protection key. This alternate design would provide a simpler interface to the user space since the BPF program would not need to deal with the architecture specific pkeys field in seccomp_data, but the question is, how much of an advantage this would be as the nr field in seccomp_data is already architecture specific. Adding a new instruction for BPF programs is more complicated than this patch and might be a breaking change. Results of selftests/seccomp_benchmark.c on a x86 machine with pkeys support: With patch: Benchmarking 33554432 samples... 28.019505558 - 18.676858522 = 9342647036 getpid native: 278 ns 42.279109885 - 28.019657031 = 14259452854 getpid RET_ALLOW: 424 ns Estimated seccomp overhead per syscall: 146 ns Without patch: Benchmarking 33554432 samples... 28.059619466 - 18.706769155 = 9352850311 getpid native: 278 ns 42.299228279 - 28.059761804 = 14239466475 getpid RET_ALLOW: 424 ns Estimated seccomp overhead per syscall: 146 ns Cc: Kees Cook Cc: Andy Lutomirski Cc: Will Drewry Cc: Ram Pai Cc: linuxppc-dev@lists.ozlabs.org Cc: linux-...@vger.kernel.org Signed-off-by: Michael Sammler --- Changes to the previous version: - added motivation, notes about POWER, alternative design and benchmark results to the commit log - renamed pkru field in seccomp_data to pkeys - changed size of pkru field to __u64 and removed reserved field - added test for x86 arch/mips/kernel/ptrace.c
Re: [RFC PATCH] seccomp: Add protection keys into seccomp_data
On 10/29/2018 05:48 PM, Jann Horn wrote: On Mon, Oct 29, 2018 at 5:37 PM Dave Hansen wrote: On 10/29/18 9:25 AM, Kees Cook wrote: On Mon, Oct 29, 2018 at 4:23 AM, Michael Sammler wrote: Add the current value of an architecture specific protection keys register (currently PKRU on x86) to data available for seccomp-bpf programs to work on. This allows filters based on the currently enabled protection keys. How does the current "assignment" of protection keys to the various uses get communicated to the filter? I assume that you first allocate your protection keys, then install the filter? Yes, but I agree that it should probably be documented, that the filter should only look at the parts of the PKRU, which belong to pkeys the user space program allocated (if the kernel wants to use some parts of the PKRU for its own purposes). I'm not sure this is a great use for PKRU. I *think* the basic problem is that you want to communicate some rights information down into a filter, and you want to communicate it with PKRU. While it's handy to have an extra register that nobody (generally) mucks with, I'm not quite convinced that we want to repurpose it this way. That's not how I understand it; I believe that the context is probably https://arxiv.org/pdf/1801.06822.pdf ? My understanding is that PKRU is used for lightweight in-process sandboxing, and to extend this sandbox protection to the syscall interface, it is necessary to expose PKRU state to seccomp filters. In other words, this isn't using PKRU exclusively for passing rights into a filter, but it has to use PKRU anyway. Yes, https://arxiv.org/pdf/1801.06822.pdf is indeed the context and what you say is correct. Also, I'm not sure the kernel provides the PKRU guarantees you want at the moment. Our implementation *probably* works, but it's mostly by accident. I don't know, which guarantees about the PKRU are provided at the moment, but the only guarantee needed for this patch is, that the kernel does not change the bits of the PKRU register, which belong to pkeys allocated by the user program, between the syscall entry and the call to secure_computing(). Is there are use case where the kernel would like to modify these bits of the PKRU? -- MIchael
Re: [RFC PATCH] seccomp: Add protection keys into seccomp_data
On Mon, Oct 29, 2018 at 09:25:15AM -0700, Kees Cook wrote: > On Mon, Oct 29, 2018 at 4:23 AM, Michael Sammler wrote: > > Add the current value of an architecture specific protection keys > > register (currently PKRU on x86) to data available for seccomp-bpf > > programs to work on. This allows filters based on the currently > > enabled protection keys. > > > > Support for protection keys on the POWER architecture is not part of > > this patch since I do not have access to a PowerPC, but adding support > > for it can be achieved by setting sd->pkeys to the AMR register in > > populate_seccomp_data. Maybe you can use a generic read_pkey() function, and each arch can provide their own implementation. In the case of powerpc it is mfspr(SPRN_AMR); something like the code below might help? diff --git a/arch/powerpc/include/asm/pkeys.h b/arch/powerpc/include/asm/pkeys.h index 92a9962..d79efe3 100644 --- a/arch/powerpc/include/asm/pkeys.h +++ b/arch/powerpc/include/asm/pkeys.h @@ -89,6 +89,11 @@ static inline u16 pte_to_pkey_bits(u64 pteflags) #define __mm_pkey_is_reserved(pkey) (reserved_allocation_mask & \ pkey_alloc_mask(pkey)) +static inline u64 read_pkey() +{ +return mfspr(SPRN_AMR); +} + static inline bool mm_pkey_is_allocated(struct mm_struct *mm, int pkey) { if (pkey < 0 || pkey >= arch_max_pkey()) diff --git a/arch/x86/include/asm/pkeys.h b/arch/x86/include/asm/pkeys.h index 19b137f..feea114 100644 --- a/arch/x86/include/asm/pkeys.h +++ b/arch/x86/include/asm/pkeys.h @@ -9,6 +9,11 @@ extern int arch_set_user_pkey_access(struct task_struct *tsk, int pkey, unsigned long init_val); +static inline u64 read_pkey() +{ +return read_pkru(); +} + static inline bool arch_pkeys_enabled(void) { return boot_cpu_has(X86_FEATURE_OSPKE); diff --git a/include/linux/pkeys.h b/include/linux/pkeys.h index 2955ba97..f1538c7 100644 --- a/include/linux/pkeys.h +++ b/include/linux/pkeys.h @@ -13,6 +13,11 @@ #define PKEY_DEDICATED_EXECUTE_ONLY 0 #define ARCH_VM_PKEY_FLAGS 0 +static inline u64 read_pkey() +{ + return 0; +} + static inline int vma_pkey(struct vm_area_struct *vma) { return 0; RP > > > > One use case for this patch is disabling unnecessary system calls for a > > library (e.g. network i/o for a crypto library) while the library runs > > without disabling the system calls for the whole program (by changing > > the protection keys before and after the library executes). Using this > > one could ensure that the library behaves a expected (e.g. the crypto > > library not sending private keys to a malicious server). > > > > This patch also enables lightweight sandboxing of untrusted code using > > memory protection keys: Protection keys provide memory isolation but > > for a sandbox system call isolation is needed as well. This patch > > allows writing a seccomp filter to prevent system calls by the > > untrusted code while still allowing system calls for the trusted code. > > Isn't PKU instruction based? Couldn't a malicious library just change > the state of the MPK registers? This seems like an easy way to bypass > any filters that used PKU. I'm not convinced this is a meaningful > barrier that should be enforced by seccomp. > > This can also be done with a signal handler with SECCOMP_RET_TRAP and > check instruction pointer vs PKU state. > > -Kees > > > > > An alternative design would be to extend (c)BPF with a new instruction > > to read the state of a protection key. This alternate design would > > provide a simpler interface to the user space since the BPF program > > would not need to deal with the architecture specific pkeys field in > > seccomp_data, but the question is, how much of an advantage this would > > be as the nr field in seccomp_data is already architecture specific. > > Adding a new instruction for BPF programs is more complicated than > > this patch and might be a breaking change. > > > > Results of selftests/seccomp_benchmark.c on a x86 machine with pkeys > > support: > > > > With patch: > > Benchmarking 33554432 samples... > > 28.019505558 - 18.676858522 = 9342647036 > > getpid native: 278 ns > > 42.279109885 - 28.019657031 = 14259452854 > > getpid RET_ALLOW: 424 ns > > Estimated seccomp overhead per syscall: 146 ns > > > > Without patch: > > Benchmarking 33554432 samples... > > 28.059619466 - 18.706769155 = 9352850311 > > getpid native: 278 ns > > 42.299228279 - 28.059761804 = 14239466475 > > getpid RET_ALLOW: 424 ns > > Estimated seccomp overhead per syscall: 146 ns > > > > Cc: Kees Cook > > Cc: Andy Lutomirski > > Cc: Will Drewry > > Cc: Ram Pai > > Cc: linuxppc-dev@lists.ozlabs.org > > Cc: linux-...@vger.kernel.org > > Signed-off-by: Michael Sammler > > --- > > Changes to the previous version: > > - added motivation, notes about POWER, alternative design and benchmark > > results to the commit log > > - renamed pkru field in seccomp_data to pkeys > > - changed
Re: [RFC PATCH] seccomp: Add protection keys into seccomp_data
On 10/29/18 9:25 AM, Kees Cook wrote: > On Mon, Oct 29, 2018 at 4:23 AM, Michael Sammler wrote: >> Add the current value of an architecture specific protection keys >> register (currently PKRU on x86) to data available for seccomp-bpf >> programs to work on. This allows filters based on the currently >> enabled protection keys. How does the current "assignment" of protection keys to the various uses get communicated to the filter? I'm not sure this is a great use for PKRU. I *think* the basic problem is that you want to communicate some rights information down into a filter, and you want to communicate it with PKRU. While it's handy to have an extra register that nobody (generally) mucks with, I'm not quite convinced that we want to repurpose it this way. Also, I'm not sure the kernel provides the PKRU guarantees you want at the moment. Our implementation *probably* works, but it's mostly by accident.
Re: [RFC PATCH] seccomp: Add protection keys into seccomp_data
On Mon, Oct 29, 2018 at 4:23 AM, Michael Sammler wrote: > Add the current value of an architecture specific protection keys > register (currently PKRU on x86) to data available for seccomp-bpf > programs to work on. This allows filters based on the currently > enabled protection keys. > > Support for protection keys on the POWER architecture is not part of > this patch since I do not have access to a PowerPC, but adding support > for it can be achieved by setting sd->pkeys to the AMR register in > populate_seccomp_data. > > One use case for this patch is disabling unnecessary system calls for a > library (e.g. network i/o for a crypto library) while the library runs > without disabling the system calls for the whole program (by changing > the protection keys before and after the library executes). Using this > one could ensure that the library behaves a expected (e.g. the crypto > library not sending private keys to a malicious server). > > This patch also enables lightweight sandboxing of untrusted code using > memory protection keys: Protection keys provide memory isolation but > for a sandbox system call isolation is needed as well. This patch > allows writing a seccomp filter to prevent system calls by the > untrusted code while still allowing system calls for the trusted code. Isn't PKU instruction based? Couldn't a malicious library just change the state of the MPK registers? This seems like an easy way to bypass any filters that used PKU. I'm not convinced this is a meaningful barrier that should be enforced by seccomp. This can also be done with a signal handler with SECCOMP_RET_TRAP and check instruction pointer vs PKU state. -Kees > > An alternative design would be to extend (c)BPF with a new instruction > to read the state of a protection key. This alternate design would > provide a simpler interface to the user space since the BPF program > would not need to deal with the architecture specific pkeys field in > seccomp_data, but the question is, how much of an advantage this would > be as the nr field in seccomp_data is already architecture specific. > Adding a new instruction for BPF programs is more complicated than > this patch and might be a breaking change. > > Results of selftests/seccomp_benchmark.c on a x86 machine with pkeys > support: > > With patch: > Benchmarking 33554432 samples... > 28.019505558 - 18.676858522 = 9342647036 > getpid native: 278 ns > 42.279109885 - 28.019657031 = 14259452854 > getpid RET_ALLOW: 424 ns > Estimated seccomp overhead per syscall: 146 ns > > Without patch: > Benchmarking 33554432 samples... > 28.059619466 - 18.706769155 = 9352850311 > getpid native: 278 ns > 42.299228279 - 28.059761804 = 14239466475 > getpid RET_ALLOW: 424 ns > Estimated seccomp overhead per syscall: 146 ns > > Cc: Kees Cook > Cc: Andy Lutomirski > Cc: Will Drewry > Cc: Ram Pai > Cc: linuxppc-dev@lists.ozlabs.org > Cc: linux-...@vger.kernel.org > Signed-off-by: Michael Sammler > --- > Changes to the previous version: > - added motivation, notes about POWER, alternative design and benchmark > results to the commit log > - renamed pkru field in seccomp_data to pkeys > - changed size of pkru field to __u64 and removed reserved field > - added test for x86 > > arch/mips/kernel/ptrace.c | 1 + > arch/x86/entry/common.c | 1 + > include/uapi/linux/seccomp.h | 3 + > kernel/seccomp.c | 1 + > tools/testing/selftests/seccomp/seccomp_bpf.c | 107 > +- > 5 files changed, 112 insertions(+), 1 deletion(-) > > diff --git a/arch/mips/kernel/ptrace.c b/arch/mips/kernel/ptrace.c > index e5ba56c0..a58dd04d 100644 > --- a/arch/mips/kernel/ptrace.c > +++ b/arch/mips/kernel/ptrace.c > @@ -1277,6 +1277,7 @@ asmlinkage long syscall_trace_enter(struct pt_regs > *regs, long syscall) > for (i = 0; i < 6; i++) > sd.args[i] = args[i]; > sd.instruction_pointer = KSTK_EIP(current); > + sd.pkeys = 0; > > ret = __secure_computing(); > if (ret == -1) > diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c > index 3b2490b8..20c51bf2 100644 > --- a/arch/x86/entry/common.c > +++ b/arch/x86/entry/common.c > @@ -98,6 +98,7 @@ static long syscall_trace_enter(struct pt_regs *regs) > sd.arch = arch; > sd.nr = regs->orig_ax; > sd.instruction_pointer = regs->ip; > + sd.pkeys = read_pkru(); > #ifdef CONFIG_X86_64 > if (arch == AUDIT_ARCH_X86_64) { > sd.args[0] = regs->di; > diff --git a/include/uapi/linux/seccomp.h b/include/uapi/linux/seccomp.h > index 9efc0e73..3aa2d934 100644 > --- a/include/uapi/linux/seccomp.h > +++ b/include/uapi/linux/seccomp.h > @@ -52,12 +52,15 @@ > * @instruction_pointer: at the time of the system call. > * @args: up to 6 system call arguments always stored