Re: [PATCH 1/1] Sealable memory support
Hi Igor, [auto build test ERROR on mmotm/master] [also build test ERROR on v4.12-rc3 next-20170602] [if your patch is applied to the wrong git tree, please drop us a note to help improve the system] url: https://github.com/0day-ci/linux/commits/Igor-Stoppa/Sealable-memory-support/20170522-163525 base: git://git.cmpxchg.org/linux-mmotm.git master config: i386-randconfig-n0-06032349 (attached as .config) compiler: gcc-6 (Debian 6.2.0-3) 6.2.0 20160901 reproduce: # save the attached .config to linux build tree make ARCH=i386 All errors (new ones prefixed by >>): mm//smalloc.c: In function 'smalloc_seal_set': >> mm//smalloc.c:135:4: error: implicit declaration of function 'set_memory_ro' >> [-Werror=implicit-function-declaration] set_memory_ro((unsigned long)node, ^ >> mm//smalloc.c:138:4: error: implicit declaration of function 'set_memory_rw' >> [-Werror=implicit-function-declaration] set_memory_rw((unsigned long)node, ^ cc1: some warnings being treated as errors vim +/set_memory_ro +135 mm//smalloc.c 129 mutex_unlock(>lock); 130 return; 131 } 132 list_for_each(pos, >list) { 133 node = list_entry(pos, struct smalloc_node, list); 134 if (seal == SMALLOC_SEALED) > 135 set_memory_ro((unsigned long)node, 136get_node_pages_nr(node)); 137 else if (seal == SMALLOC_UNSEALED) > 138 set_memory_rw((unsigned long)node, 139get_node_pages_nr(node)); 140 } 141 pool->seal = seal; --- 0-DAY kernel test infrastructureOpen Source Technology Center https://lists.01.org/pipermail/kbuild-all Intel Corporation .config.gz Description: application/gzip
Re: [PATCH 1/1] Sealable memory support
Hi Igor, [auto build test ERROR on mmotm/master] [also build test ERROR on v4.12-rc3 next-20170602] [if your patch is applied to the wrong git tree, please drop us a note to help improve the system] url: https://github.com/0day-ci/linux/commits/Igor-Stoppa/Sealable-memory-support/20170522-163525 base: git://git.cmpxchg.org/linux-mmotm.git master config: i386-randconfig-n0-06032349 (attached as .config) compiler: gcc-6 (Debian 6.2.0-3) 6.2.0 20160901 reproduce: # save the attached .config to linux build tree make ARCH=i386 All errors (new ones prefixed by >>): mm//smalloc.c: In function 'smalloc_seal_set': >> mm//smalloc.c:135:4: error: implicit declaration of function 'set_memory_ro' >> [-Werror=implicit-function-declaration] set_memory_ro((unsigned long)node, ^ >> mm//smalloc.c:138:4: error: implicit declaration of function 'set_memory_rw' >> [-Werror=implicit-function-declaration] set_memory_rw((unsigned long)node, ^ cc1: some warnings being treated as errors vim +/set_memory_ro +135 mm//smalloc.c 129 mutex_unlock(>lock); 130 return; 131 } 132 list_for_each(pos, >list) { 133 node = list_entry(pos, struct smalloc_node, list); 134 if (seal == SMALLOC_SEALED) > 135 set_memory_ro((unsigned long)node, 136get_node_pages_nr(node)); 137 else if (seal == SMALLOC_UNSEALED) > 138 set_memory_rw((unsigned long)node, 139get_node_pages_nr(node)); 140 } 141 pool->seal = seal; --- 0-DAY kernel test infrastructureOpen Source Technology Center https://lists.01.org/pipermail/kbuild-all Intel Corporation .config.gz Description: application/gzip
Re: [PATCH 1/1] Sealable memory support
On 28/05/17 21:23, Kees Cook wrote: > On Wed, May 24, 2017 at 10:45 AM, Igor Stoppawrote: [...] >> If the CPU1 were to forcibly halt anything that can race with it, then >> it would be sure that there was no interference. > > Correct. This is actually what ARM does for doing kernel memory > writing when poking stuff for kprobes, etc. It's rather dramatic, > though. :) ok >> A reactive approach could be, instead, to re-validate the content after >> the sealing, assuming that it is possible. > > I would prefer to avoid this, as that allows an attacker to still have > made the changes (which could even result in them then disabling the > re-validation during the attack). ok [...] >> If you are talking about an attacker, rather than protection against >> accidental overwrites, how hashing can be enough? >> Couldn't the attacker compromise that too? > > In theory, yes, though the goal was to dedicate a register to the > hash, which would make it hard/impossible for an attacker to reach. > (The BPF JIT situation is just an example of this kind of race, > though. I'm still in favor of reducing the write window to init-time > from full run-time.) It looks like some compromise is needed between reliability and performance ... [...] > I would expect other people would NAK using "stop all other CPUs" > approach. Better to have the CPU-local writes. ok, that could be a feature for the follow-up? [...] > Yeah, I don't have any better idea for names. :) still clinging to smalloc for now, how about pmalloc, as "protected malloc"? > We might be talking past each-other. Right now, the LSM is marked with > __ro_after_init, which will make all the list structures, entries, etc > read-only already. There is one case where this is not true, and > that's for CONFIG_SECURITY_WRITABLE_HOOKS for > CONFIG_SECURITY_SELINUX_DISABLE, which wants to do run-time removal of > SELinux. On this I'll just shut up and provide the patch. It seems easier :-) > Are you talking about the SELinux policy installed during > early boot? Which things did you want to use smalloc() on? The policy DB. And of course the policy cache has to go, as in: it must be possible to disable it completely, in favor of re-evaluating the policies, when needed. I'm pretty sure that it is absolutely needed in certain cases, but in others I think the re-evaluation is negligible in comparison to the overall duration of the use cases affected. So it should be ok to let the system integrator gauge speed (if any) vs resilience, by enablying/disabling the SE Linux policy cache. >>> It seems like smalloc pools could also be refcounted? > > I meant things that point into an smalloc() pool could keep a refcount > and when nothing was pointing into it, it could be destroyed. (i.e. > using refcount garbage collection instead of explicit destruction.) ok, but it depends on the level of paranoia that we are talking about. Counting smalloc vs sfree is easy. Verifying that the free invocations are actually aligned with previous allocations is a bit more onerous. Validating that the free is done from the same entity that invoked the smalloc might be hard. >> And what for? > > It might be easier to reason about later if allocations get complex. > It's certainly not required for the first version of this. good :-) more clarifications are needed [...] > I don't really have an opinion on this. It might be more readable with > a named structure? Eventually I got rid of the macro. The structure is intentionally unnamed to stress the grouping, without adding a layer of indirection. But maybe it could be removed. I need to test it. >> One more thing: how should I tie this allocator to the rest? >> I have verified that is seems to work with both SLUB and SLAB. >> Can I make it depend on either of them being enabled? > > It seems totally unrelated. The only relationship I see would be > interaction with hardened usercopy. In a perfect world, none of the > smalloc pools would be allowed to be copied to/from userspace, which > would make integration really easy: if smalloc_pool(ptr) return NOPE; > :P I went down the harder path and implemented the check. >> Should it be optionally enabled? >> What to default to, if it's not enabled? vmalloc? > > I don't see any reason to make it optional. great \o/ So now I'm back to the LSM use case. It shouldn't take too long. -- thanks, igor
Re: [PATCH 1/1] Sealable memory support
On 28/05/17 21:23, Kees Cook wrote: > On Wed, May 24, 2017 at 10:45 AM, Igor Stoppa wrote: [...] >> If the CPU1 were to forcibly halt anything that can race with it, then >> it would be sure that there was no interference. > > Correct. This is actually what ARM does for doing kernel memory > writing when poking stuff for kprobes, etc. It's rather dramatic, > though. :) ok >> A reactive approach could be, instead, to re-validate the content after >> the sealing, assuming that it is possible. > > I would prefer to avoid this, as that allows an attacker to still have > made the changes (which could even result in them then disabling the > re-validation during the attack). ok [...] >> If you are talking about an attacker, rather than protection against >> accidental overwrites, how hashing can be enough? >> Couldn't the attacker compromise that too? > > In theory, yes, though the goal was to dedicate a register to the > hash, which would make it hard/impossible for an attacker to reach. > (The BPF JIT situation is just an example of this kind of race, > though. I'm still in favor of reducing the write window to init-time > from full run-time.) It looks like some compromise is needed between reliability and performance ... [...] > I would expect other people would NAK using "stop all other CPUs" > approach. Better to have the CPU-local writes. ok, that could be a feature for the follow-up? [...] > Yeah, I don't have any better idea for names. :) still clinging to smalloc for now, how about pmalloc, as "protected malloc"? > We might be talking past each-other. Right now, the LSM is marked with > __ro_after_init, which will make all the list structures, entries, etc > read-only already. There is one case where this is not true, and > that's for CONFIG_SECURITY_WRITABLE_HOOKS for > CONFIG_SECURITY_SELINUX_DISABLE, which wants to do run-time removal of > SELinux. On this I'll just shut up and provide the patch. It seems easier :-) > Are you talking about the SELinux policy installed during > early boot? Which things did you want to use smalloc() on? The policy DB. And of course the policy cache has to go, as in: it must be possible to disable it completely, in favor of re-evaluating the policies, when needed. I'm pretty sure that it is absolutely needed in certain cases, but in others I think the re-evaluation is negligible in comparison to the overall duration of the use cases affected. So it should be ok to let the system integrator gauge speed (if any) vs resilience, by enablying/disabling the SE Linux policy cache. >>> It seems like smalloc pools could also be refcounted? > > I meant things that point into an smalloc() pool could keep a refcount > and when nothing was pointing into it, it could be destroyed. (i.e. > using refcount garbage collection instead of explicit destruction.) ok, but it depends on the level of paranoia that we are talking about. Counting smalloc vs sfree is easy. Verifying that the free invocations are actually aligned with previous allocations is a bit more onerous. Validating that the free is done from the same entity that invoked the smalloc might be hard. >> And what for? > > It might be easier to reason about later if allocations get complex. > It's certainly not required for the first version of this. good :-) more clarifications are needed [...] > I don't really have an opinion on this. It might be more readable with > a named structure? Eventually I got rid of the macro. The structure is intentionally unnamed to stress the grouping, without adding a layer of indirection. But maybe it could be removed. I need to test it. >> One more thing: how should I tie this allocator to the rest? >> I have verified that is seems to work with both SLUB and SLAB. >> Can I make it depend on either of them being enabled? > > It seems totally unrelated. The only relationship I see would be > interaction with hardened usercopy. In a perfect world, none of the > smalloc pools would be allowed to be copied to/from userspace, which > would make integration really easy: if smalloc_pool(ptr) return NOPE; > :P I went down the harder path and implemented the check. >> Should it be optionally enabled? >> What to default to, if it's not enabled? vmalloc? > > I don't see any reason to make it optional. great \o/ So now I'm back to the LSM use case. It shouldn't take too long. -- thanks, igor
Re: [PATCH 1/1] Sealable memory support
Hi Igor, [auto build test ERROR on mmotm/master] [also build test ERROR on v4.12-rc3 next-20170531] [if your patch is applied to the wrong git tree, please drop us a note to help improve the system] url: https://github.com/0day-ci/linux/commits/Igor-Stoppa/Sealable-memory-support/20170522-163525 base: git://git.cmpxchg.org/linux-mmotm.git master config: x86_64-randconfig-v0-05311736 (attached as .config) compiler: gcc-4.4 (Debian 4.4.7-8) 4.4.7 reproduce: # save the attached .config to linux build tree make ARCH=x86_64 All errors (new ones prefixed by >>): In file included from mm/smalloc.c:24: >> mm/smalloc.h:47: error: flexible array member in otherwise empty struct mm/smalloc.c: In function 'smalloc_seal_set': mm/smalloc.c:135: error: implicit declaration of function 'set_memory_ro' mm/smalloc.c:138: error: implicit declaration of function 'set_memory_rw' vim +47 mm/smalloc.h 41 struct mutex lock; 42 enum seal_t seal; 43 }; 44 45 struct smalloc_node { 46 NODE_HEADER; > 47 __SMALLOC_ALIGNED__ align_t data[]; 48 }; 49 50 #define smalloc_seal(pool) \ --- 0-DAY kernel test infrastructureOpen Source Technology Center https://lists.01.org/pipermail/kbuild-all Intel Corporation .config.gz Description: application/gzip
Re: [PATCH 1/1] Sealable memory support
Hi Igor, [auto build test ERROR on mmotm/master] [also build test ERROR on v4.12-rc3 next-20170531] [if your patch is applied to the wrong git tree, please drop us a note to help improve the system] url: https://github.com/0day-ci/linux/commits/Igor-Stoppa/Sealable-memory-support/20170522-163525 base: git://git.cmpxchg.org/linux-mmotm.git master config: x86_64-randconfig-v0-05311736 (attached as .config) compiler: gcc-4.4 (Debian 4.4.7-8) 4.4.7 reproduce: # save the attached .config to linux build tree make ARCH=x86_64 All errors (new ones prefixed by >>): In file included from mm/smalloc.c:24: >> mm/smalloc.h:47: error: flexible array member in otherwise empty struct mm/smalloc.c: In function 'smalloc_seal_set': mm/smalloc.c:135: error: implicit declaration of function 'set_memory_ro' mm/smalloc.c:138: error: implicit declaration of function 'set_memory_rw' vim +47 mm/smalloc.h 41 struct mutex lock; 42 enum seal_t seal; 43 }; 44 45 struct smalloc_node { 46 NODE_HEADER; > 47 __SMALLOC_ALIGNED__ align_t data[]; 48 }; 49 50 #define smalloc_seal(pool) \ --- 0-DAY kernel test infrastructureOpen Source Technology Center https://lists.01.org/pipermail/kbuild-all Intel Corporation .config.gz Description: application/gzip
Re: [kernel-hardening] Re: [PATCH 1/1] Sealable memory support
If i understand the current direction for smalloc, its to implement it without the ability to "unseal," which has implications on how LSM implementations and other users of these dynamic allocations handle things. If its implemented without a writeable interface for modules which need it, then switched to a CoW (or other controlled/isolated/hardened) mechanism for creating writeable segments in dynamic allocations, that would create another set of breaking changes for the consumers adapting to the first set. In the spirit of adopting things faster/smoother, maybe it makes sense to permit the smalloc re-writing method suggested in the first implementation; with an API which consumers can later use for a more secure approach that reduces the attack surface without breaking consumer interfaces (or internal semantics). A sysctl affecting the behavior as ro-only or ro-and-sometimes-rw would give users the flexibility to tune their environment per their needs, which should also reduce potential conflict/complaints. On Sun, May 28, 2017 at 5:32 PM, Kees Cookwrote: > On Sun, May 28, 2017 at 11:56 AM, Boris Lukashev > wrote: >> So what about a middle ground where CoW semantics are used to enforce >> the state of these allocations as RO, but provide a strictly >> controlled pathway to read the RO data, copy and modify it, then write >> and seal into a new allocation. Successful return from this process >> should permit the page table to change the pointer to where the object >> now resides, and initiate freeing of the original memory so long as a >> refcount is kept for accesses. That way, sealable memory is sealed, >> and any consumers reading it will be using the original ptr to the >> original smalloc region. Attackers who do manage to change the > > This could be another way to do it, yeah, and it helps that smalloc() > is built on vmalloc(). It'd require some careful design, but it could > be a way forward after this initial sealed-after-init version goes in. > >> Lastly, my meager understanding is that PAX set the entire kernel as >> RO, and implemented writeable access via pax_open/close. How were they >> fighting against race conditions, and what is the benefit of specific >> regions being allocated this way as opposed to the RO-all-the-things >> approach which makes writes a specialized set of operations? > > My understanding is that PaX's KERNEXEC with the constification plugin > moves a substantial portion of the kernel's .data section > (effectively) into the .rodata section. It's not the "entire" kernel. > (Well, depending on how you count. The .text section is already > read-only upstream.) PaX, as far as I know, provided no dynamic memory > allocation protections, like smalloc() would provide. > > -Kees > > -- > Kees Cook > Pixel Security -- Boris Lukashev Systems Architect Semper Victus
Re: [kernel-hardening] Re: [PATCH 1/1] Sealable memory support
If i understand the current direction for smalloc, its to implement it without the ability to "unseal," which has implications on how LSM implementations and other users of these dynamic allocations handle things. If its implemented without a writeable interface for modules which need it, then switched to a CoW (or other controlled/isolated/hardened) mechanism for creating writeable segments in dynamic allocations, that would create another set of breaking changes for the consumers adapting to the first set. In the spirit of adopting things faster/smoother, maybe it makes sense to permit the smalloc re-writing method suggested in the first implementation; with an API which consumers can later use for a more secure approach that reduces the attack surface without breaking consumer interfaces (or internal semantics). A sysctl affecting the behavior as ro-only or ro-and-sometimes-rw would give users the flexibility to tune their environment per their needs, which should also reduce potential conflict/complaints. On Sun, May 28, 2017 at 5:32 PM, Kees Cook wrote: > On Sun, May 28, 2017 at 11:56 AM, Boris Lukashev > wrote: >> So what about a middle ground where CoW semantics are used to enforce >> the state of these allocations as RO, but provide a strictly >> controlled pathway to read the RO data, copy and modify it, then write >> and seal into a new allocation. Successful return from this process >> should permit the page table to change the pointer to where the object >> now resides, and initiate freeing of the original memory so long as a >> refcount is kept for accesses. That way, sealable memory is sealed, >> and any consumers reading it will be using the original ptr to the >> original smalloc region. Attackers who do manage to change the > > This could be another way to do it, yeah, and it helps that smalloc() > is built on vmalloc(). It'd require some careful design, but it could > be a way forward after this initial sealed-after-init version goes in. > >> Lastly, my meager understanding is that PAX set the entire kernel as >> RO, and implemented writeable access via pax_open/close. How were they >> fighting against race conditions, and what is the benefit of specific >> regions being allocated this way as opposed to the RO-all-the-things >> approach which makes writes a specialized set of operations? > > My understanding is that PaX's KERNEXEC with the constification plugin > moves a substantial portion of the kernel's .data section > (effectively) into the .rodata section. It's not the "entire" kernel. > (Well, depending on how you count. The .text section is already > read-only upstream.) PaX, as far as I know, provided no dynamic memory > allocation protections, like smalloc() would provide. > > -Kees > > -- > Kees Cook > Pixel Security -- Boris Lukashev Systems Architect Semper Victus
Re: [kernel-hardening] Re: [PATCH 1/1] Sealable memory support
On Sun, May 28, 2017 at 11:56 AM, Boris Lukashevwrote: > So what about a middle ground where CoW semantics are used to enforce > the state of these allocations as RO, but provide a strictly > controlled pathway to read the RO data, copy and modify it, then write > and seal into a new allocation. Successful return from this process > should permit the page table to change the pointer to where the object > now resides, and initiate freeing of the original memory so long as a > refcount is kept for accesses. That way, sealable memory is sealed, > and any consumers reading it will be using the original ptr to the > original smalloc region. Attackers who do manage to change the This could be another way to do it, yeah, and it helps that smalloc() is built on vmalloc(). It'd require some careful design, but it could be a way forward after this initial sealed-after-init version goes in. > Lastly, my meager understanding is that PAX set the entire kernel as > RO, and implemented writeable access via pax_open/close. How were they > fighting against race conditions, and what is the benefit of specific > regions being allocated this way as opposed to the RO-all-the-things > approach which makes writes a specialized set of operations? My understanding is that PaX's KERNEXEC with the constification plugin moves a substantial portion of the kernel's .data section (effectively) into the .rodata section. It's not the "entire" kernel. (Well, depending on how you count. The .text section is already read-only upstream.) PaX, as far as I know, provided no dynamic memory allocation protections, like smalloc() would provide. -Kees -- Kees Cook Pixel Security
Re: [kernel-hardening] Re: [PATCH 1/1] Sealable memory support
On Sun, May 28, 2017 at 11:56 AM, Boris Lukashev wrote: > So what about a middle ground where CoW semantics are used to enforce > the state of these allocations as RO, but provide a strictly > controlled pathway to read the RO data, copy and modify it, then write > and seal into a new allocation. Successful return from this process > should permit the page table to change the pointer to where the object > now resides, and initiate freeing of the original memory so long as a > refcount is kept for accesses. That way, sealable memory is sealed, > and any consumers reading it will be using the original ptr to the > original smalloc region. Attackers who do manage to change the This could be another way to do it, yeah, and it helps that smalloc() is built on vmalloc(). It'd require some careful design, but it could be a way forward after this initial sealed-after-init version goes in. > Lastly, my meager understanding is that PAX set the entire kernel as > RO, and implemented writeable access via pax_open/close. How were they > fighting against race conditions, and what is the benefit of specific > regions being allocated this way as opposed to the RO-all-the-things > approach which makes writes a specialized set of operations? My understanding is that PaX's KERNEXEC with the constification plugin moves a substantial portion of the kernel's .data section (effectively) into the .rodata section. It's not the "entire" kernel. (Well, depending on how you count. The .text section is already read-only upstream.) PaX, as far as I know, provided no dynamic memory allocation protections, like smalloc() would provide. -Kees -- Kees Cook Pixel Security
Re: [kernel-hardening] Re: [PATCH 1/1] Sealable memory support
One-time sealable memory makes the most sense from a defensive perspective - red team reads this stuff, the races mentioned will be implemented as described to win the day, and probably in other innovative ways. If a gap is left in the implementation, without explicit coverage by an adjacent function, it will be used no matter how small the chances of it occurring in the real world are - grooming systems to create unlikely conditions is fair play (look at eternalblue's SMB pool machinations). However, out of tree modules will likely not appreciate this - third party LSMs, tpe module, SCST, etc. I dont want to get into the "NIH" debate, they're real functional components, used by real companies, often enough that they need to be considered a member of the ecosystem, even if not a first-order member. So what about a middle ground where CoW semantics are used to enforce the state of these allocations as RO, but provide a strictly controlled pathway to read the RO data, copy and modify it, then write and seal into a new allocation. Successful return from this process should permit the page table to change the pointer to where the object now resides, and initiate freeing of the original memory so long as a refcount is kept for accesses. That way, sealable memory is sealed, and any consumers reading it will be using the original ptr to the original smalloc region. Attackers who do manage to change the allocation by writing a new one still have to figure out how to change the region used by an existing consumer. New consumers will get hit by whatever they changed, but the change will be tracked and require them to gain execution control over the allocator itself in order to affect the change (or ROP chain something else into doing it, but thats's a discussion on RAP/CFI). CPU-local ops are great, if they dont halt the other cores. Stopping all other CPUs is going to be DoA in HPC and other CPU intensive workloads - think what ZFS would do if its pipelines kept getting halted by something running a lot of smallocs (they get non-preemptible often enough, requiring waits on both sides of the op), how how LIO would behave - iSCSI waits for no man or woman, or their allocator strategies. I'm all for "security should be more of a concern than performance - its easier to build a faster car later if you're still alive to do it," but keeping in mind who the consumers are, i can easily see this functionality staying disabled in most distributions and thus receiving much less testing and beatdown than it should. Lastly, my meager understanding is that PAX set the entire kernel as RO, and implemented writeable access via pax_open/close. How were they fighting against race conditions, and what is the benefit of specific regions being allocated this way as opposed to the RO-all-the-things approach which makes writes a specialized set of operations? On Sun, May 28, 2017 at 2:23 PM, Kees Cookwrote: > On Wed, May 24, 2017 at 10:45 AM, Igor Stoppa wrote: >> On 23/05/17 23:11, Kees Cook wrote: >>> On Tue, May 23, 2017 at 2:43 AM, Igor Stoppa wrote: >>> I meant this: >>> >>> CPU 1 CPU 2 >>> create >>> alloc >>> write >>> seal >>> ... >>> unseal >>> write >>> write >>> seal >>> >>> The CPU 2 write would be, for example, an attacker using a >>> vulnerability to attempt to write to memory in the sealed area. All it >>> would need to do to succeed would be to trigger an action in the >>> kernel that would do a "legitimate" write (which requires the unseal), >>> and race it. Unsealing should be CPU-local, if the API is going to >>> support this kind of access. >> >> I see. >> If the CPU1 were to forcibly halt anything that can race with it, then >> it would be sure that there was no interference. > > Correct. This is actually what ARM does for doing kernel memory > writing when poking stuff for kprobes, etc. It's rather dramatic, > though. :) > >> A reactive approach could be, instead, to re-validate the content after >> the sealing, assuming that it is possible. > > I would prefer to avoid this, as that allows an attacker to still have > made the changes (which could even result in them then disabling the > re-validation during the attack). > >>> I am more concerned about _any_ unseal after initial seal. And even >>> then, it'd be nice to keep things CPU-local. My concerns are related >>> to the write-rarely proposal (https://lkml.org/lkml/2017/3/29/704) >>> which is kind of like this, but focused on the .data section, not >>> dynamic memory. It has similar concerns about CPU-locality. >>> Additionally, even writing to memory and then making it read-only >>> later runs risks (see threads about BPF JIT races vs making things >>> read-only: https://patchwork.kernel.org/patch/9662653/ Alexei's NAK >>> doesn't change the risk this series is fixing: races with attacker >>> writes during assignment but before read-only marking). >> >> If you are
Re: [kernel-hardening] Re: [PATCH 1/1] Sealable memory support
One-time sealable memory makes the most sense from a defensive perspective - red team reads this stuff, the races mentioned will be implemented as described to win the day, and probably in other innovative ways. If a gap is left in the implementation, without explicit coverage by an adjacent function, it will be used no matter how small the chances of it occurring in the real world are - grooming systems to create unlikely conditions is fair play (look at eternalblue's SMB pool machinations). However, out of tree modules will likely not appreciate this - third party LSMs, tpe module, SCST, etc. I dont want to get into the "NIH" debate, they're real functional components, used by real companies, often enough that they need to be considered a member of the ecosystem, even if not a first-order member. So what about a middle ground where CoW semantics are used to enforce the state of these allocations as RO, but provide a strictly controlled pathway to read the RO data, copy and modify it, then write and seal into a new allocation. Successful return from this process should permit the page table to change the pointer to where the object now resides, and initiate freeing of the original memory so long as a refcount is kept for accesses. That way, sealable memory is sealed, and any consumers reading it will be using the original ptr to the original smalloc region. Attackers who do manage to change the allocation by writing a new one still have to figure out how to change the region used by an existing consumer. New consumers will get hit by whatever they changed, but the change will be tracked and require them to gain execution control over the allocator itself in order to affect the change (or ROP chain something else into doing it, but thats's a discussion on RAP/CFI). CPU-local ops are great, if they dont halt the other cores. Stopping all other CPUs is going to be DoA in HPC and other CPU intensive workloads - think what ZFS would do if its pipelines kept getting halted by something running a lot of smallocs (they get non-preemptible often enough, requiring waits on both sides of the op), how how LIO would behave - iSCSI waits for no man or woman, or their allocator strategies. I'm all for "security should be more of a concern than performance - its easier to build a faster car later if you're still alive to do it," but keeping in mind who the consumers are, i can easily see this functionality staying disabled in most distributions and thus receiving much less testing and beatdown than it should. Lastly, my meager understanding is that PAX set the entire kernel as RO, and implemented writeable access via pax_open/close. How were they fighting against race conditions, and what is the benefit of specific regions being allocated this way as opposed to the RO-all-the-things approach which makes writes a specialized set of operations? On Sun, May 28, 2017 at 2:23 PM, Kees Cook wrote: > On Wed, May 24, 2017 at 10:45 AM, Igor Stoppa wrote: >> On 23/05/17 23:11, Kees Cook wrote: >>> On Tue, May 23, 2017 at 2:43 AM, Igor Stoppa wrote: >>> I meant this: >>> >>> CPU 1 CPU 2 >>> create >>> alloc >>> write >>> seal >>> ... >>> unseal >>> write >>> write >>> seal >>> >>> The CPU 2 write would be, for example, an attacker using a >>> vulnerability to attempt to write to memory in the sealed area. All it >>> would need to do to succeed would be to trigger an action in the >>> kernel that would do a "legitimate" write (which requires the unseal), >>> and race it. Unsealing should be CPU-local, if the API is going to >>> support this kind of access. >> >> I see. >> If the CPU1 were to forcibly halt anything that can race with it, then >> it would be sure that there was no interference. > > Correct. This is actually what ARM does for doing kernel memory > writing when poking stuff for kprobes, etc. It's rather dramatic, > though. :) > >> A reactive approach could be, instead, to re-validate the content after >> the sealing, assuming that it is possible. > > I would prefer to avoid this, as that allows an attacker to still have > made the changes (which could even result in them then disabling the > re-validation during the attack). > >>> I am more concerned about _any_ unseal after initial seal. And even >>> then, it'd be nice to keep things CPU-local. My concerns are related >>> to the write-rarely proposal (https://lkml.org/lkml/2017/3/29/704) >>> which is kind of like this, but focused on the .data section, not >>> dynamic memory. It has similar concerns about CPU-locality. >>> Additionally, even writing to memory and then making it read-only >>> later runs risks (see threads about BPF JIT races vs making things >>> read-only: https://patchwork.kernel.org/patch/9662653/ Alexei's NAK >>> doesn't change the risk this series is fixing: races with attacker >>> writes during assignment but before read-only marking). >> >> If you are talking about an attacker, rather than protection against >> accidental
Re: [PATCH 1/1] Sealable memory support
On Wed, May 24, 2017 at 10:45 AM, Igor Stoppawrote: > On 23/05/17 23:11, Kees Cook wrote: >> On Tue, May 23, 2017 at 2:43 AM, Igor Stoppa wrote: >> I meant this: >> >> CPU 1 CPU 2 >> create >> alloc >> write >> seal >> ... >> unseal >> write >> write >> seal >> >> The CPU 2 write would be, for example, an attacker using a >> vulnerability to attempt to write to memory in the sealed area. All it >> would need to do to succeed would be to trigger an action in the >> kernel that would do a "legitimate" write (which requires the unseal), >> and race it. Unsealing should be CPU-local, if the API is going to >> support this kind of access. > > I see. > If the CPU1 were to forcibly halt anything that can race with it, then > it would be sure that there was no interference. Correct. This is actually what ARM does for doing kernel memory writing when poking stuff for kprobes, etc. It's rather dramatic, though. :) > A reactive approach could be, instead, to re-validate the content after > the sealing, assuming that it is possible. I would prefer to avoid this, as that allows an attacker to still have made the changes (which could even result in them then disabling the re-validation during the attack). >> I am more concerned about _any_ unseal after initial seal. And even >> then, it'd be nice to keep things CPU-local. My concerns are related >> to the write-rarely proposal (https://lkml.org/lkml/2017/3/29/704) >> which is kind of like this, but focused on the .data section, not >> dynamic memory. It has similar concerns about CPU-locality. >> Additionally, even writing to memory and then making it read-only >> later runs risks (see threads about BPF JIT races vs making things >> read-only: https://patchwork.kernel.org/patch/9662653/ Alexei's NAK >> doesn't change the risk this series is fixing: races with attacker >> writes during assignment but before read-only marking). > > If you are talking about an attacker, rather than protection against > accidental overwrites, how hashing can be enough? > Couldn't the attacker compromise that too? In theory, yes, though the goal was to dedicate a register to the hash, which would make it hard/impossible for an attacker to reach. (The BPF JIT situation is just an example of this kind of race, though. I'm still in favor of reducing the write window to init-time from full run-time.) >> So, while smalloc would hugely reduce the window an attacker has >> available to change data contents, this API doesn't eliminate it. (To >> eliminate it, there would need to be a CPU-local page permission view >> that let only the current CPU to the page, and then restore it to >> read-only to match the global read-only view.) > > That or, if one is ready to take the hit, freeze every other possible > attack vector. But I'm not sure this could be justifiable. I would expect other people would NAK using "stop all other CPUs" approach. Better to have the CPU-local writes. >> Ah! In that case, sure. This isn't what the proposed API provided, >> though, so let's adjust it to only perform the unseal at destroy time. >> That makes it much saner, IMO. "Write once" dynamic allocations, or >> "read-only after seal". woalloc? :P yay naming > > For now I'm still using smalloc. > Anything that is either [x]malloc or [yz]malloc is fine, lengthwise. > Other options might require some re-formatting. Yeah, I don't have any better idea for names. :) >> Ah, okay. Most of the LSM is happily covered by __ro_after_init. If we >> could just drop the runtime disabling of SELinux, we'd be fine. > > I am not sure I understand this point. > If the kernel is properly configured, the master toggle variable > disappears, right? > Or do you mean the disabling through modifications of the linked list of > the hooks? We might be talking past each-other. Right now, the LSM is marked with __ro_after_init, which will make all the list structures, entries, etc read-only already. There is one case where this is not true, and that's for CONFIG_SECURITY_WRITABLE_HOOKS for CONFIG_SECURITY_SELINUX_DISABLE, which wants to do run-time removal of SELinux. Are you talking about the SELinux policy installed during early boot? Which things did you want to use smalloc() on? >> It seems like smalloc pools could also be refcounted? > > I am not sure what you mean. > What do you want to count? > Number of pools? Nodes per pool? Allocations per node? I meant things that point into an smalloc() pool could keep a refcount and when nothing was pointing into it, it could be destroyed. (i.e. using refcount garbage collection instead of explicit destruction.) > And what for? It might be easier to reason about later if allocations get complex. It's certainly not required for the first version of this. > At least in the case of tearing down a pool, when a module is unloaded, > nobody needs to free anything that was allocated with smalloc. > The teardown function will
Re: [PATCH 1/1] Sealable memory support
On Wed, May 24, 2017 at 10:45 AM, Igor Stoppa wrote: > On 23/05/17 23:11, Kees Cook wrote: >> On Tue, May 23, 2017 at 2:43 AM, Igor Stoppa wrote: >> I meant this: >> >> CPU 1 CPU 2 >> create >> alloc >> write >> seal >> ... >> unseal >> write >> write >> seal >> >> The CPU 2 write would be, for example, an attacker using a >> vulnerability to attempt to write to memory in the sealed area. All it >> would need to do to succeed would be to trigger an action in the >> kernel that would do a "legitimate" write (which requires the unseal), >> and race it. Unsealing should be CPU-local, if the API is going to >> support this kind of access. > > I see. > If the CPU1 were to forcibly halt anything that can race with it, then > it would be sure that there was no interference. Correct. This is actually what ARM does for doing kernel memory writing when poking stuff for kprobes, etc. It's rather dramatic, though. :) > A reactive approach could be, instead, to re-validate the content after > the sealing, assuming that it is possible. I would prefer to avoid this, as that allows an attacker to still have made the changes (which could even result in them then disabling the re-validation during the attack). >> I am more concerned about _any_ unseal after initial seal. And even >> then, it'd be nice to keep things CPU-local. My concerns are related >> to the write-rarely proposal (https://lkml.org/lkml/2017/3/29/704) >> which is kind of like this, but focused on the .data section, not >> dynamic memory. It has similar concerns about CPU-locality. >> Additionally, even writing to memory and then making it read-only >> later runs risks (see threads about BPF JIT races vs making things >> read-only: https://patchwork.kernel.org/patch/9662653/ Alexei's NAK >> doesn't change the risk this series is fixing: races with attacker >> writes during assignment but before read-only marking). > > If you are talking about an attacker, rather than protection against > accidental overwrites, how hashing can be enough? > Couldn't the attacker compromise that too? In theory, yes, though the goal was to dedicate a register to the hash, which would make it hard/impossible for an attacker to reach. (The BPF JIT situation is just an example of this kind of race, though. I'm still in favor of reducing the write window to init-time from full run-time.) >> So, while smalloc would hugely reduce the window an attacker has >> available to change data contents, this API doesn't eliminate it. (To >> eliminate it, there would need to be a CPU-local page permission view >> that let only the current CPU to the page, and then restore it to >> read-only to match the global read-only view.) > > That or, if one is ready to take the hit, freeze every other possible > attack vector. But I'm not sure this could be justifiable. I would expect other people would NAK using "stop all other CPUs" approach. Better to have the CPU-local writes. >> Ah! In that case, sure. This isn't what the proposed API provided, >> though, so let's adjust it to only perform the unseal at destroy time. >> That makes it much saner, IMO. "Write once" dynamic allocations, or >> "read-only after seal". woalloc? :P yay naming > > For now I'm still using smalloc. > Anything that is either [x]malloc or [yz]malloc is fine, lengthwise. > Other options might require some re-formatting. Yeah, I don't have any better idea for names. :) >> Ah, okay. Most of the LSM is happily covered by __ro_after_init. If we >> could just drop the runtime disabling of SELinux, we'd be fine. > > I am not sure I understand this point. > If the kernel is properly configured, the master toggle variable > disappears, right? > Or do you mean the disabling through modifications of the linked list of > the hooks? We might be talking past each-other. Right now, the LSM is marked with __ro_after_init, which will make all the list structures, entries, etc read-only already. There is one case where this is not true, and that's for CONFIG_SECURITY_WRITABLE_HOOKS for CONFIG_SECURITY_SELINUX_DISABLE, which wants to do run-time removal of SELinux. Are you talking about the SELinux policy installed during early boot? Which things did you want to use smalloc() on? >> It seems like smalloc pools could also be refcounted? > > I am not sure what you mean. > What do you want to count? > Number of pools? Nodes per pool? Allocations per node? I meant things that point into an smalloc() pool could keep a refcount and when nothing was pointing into it, it could be destroyed. (i.e. using refcount garbage collection instead of explicit destruction.) > And what for? It might be easier to reason about later if allocations get complex. It's certainly not required for the first version of this. > At least in the case of tearing down a pool, when a module is unloaded, > nobody needs to free anything that was allocated with smalloc. > The teardown function will free the pages from each node. Right, yeah.
Re: [PATCH 1/1] Sealable memory support
On 23/05/17 23:11, Kees Cook wrote: > On Tue, May 23, 2017 at 2:43 AM, Igor Stoppawrote: [...] > I would want hardened usercopy support as a requirement for using > smalloc(). Without it, we're regressing the over-read protection that > already exists for slab objects, if kernel code switched from slab to > smalloc. It should be very similar to the existing slab checks. "Is > this a smalloc object? Have we read beyond the end of a given object?" > etc. The metadata is all there, except for an efficient way to mark a > page as a smalloc page, but I think that just requires a new Page$Name > bit test, as done for slab. ok [...] > I meant this: > > CPU 1 CPU 2 > create > alloc > write > seal > ... > unseal > write > write > seal > > The CPU 2 write would be, for example, an attacker using a > vulnerability to attempt to write to memory in the sealed area. All it > would need to do to succeed would be to trigger an action in the > kernel that would do a "legitimate" write (which requires the unseal), > and race it. Unsealing should be CPU-local, if the API is going to > support this kind of access. I see. If the CPU1 were to forcibly halt anything that can race with it, then it would be sure that there was no interference. A reactive approach could be, instead, to re-validate the content after the sealing, assuming that it is possible. [...] > I am more concerned about _any_ unseal after initial seal. And even > then, it'd be nice to keep things CPU-local. My concerns are related > to the write-rarely proposal (https://lkml.org/lkml/2017/3/29/704) > which is kind of like this, but focused on the .data section, not > dynamic memory. It has similar concerns about CPU-locality. > Additionally, even writing to memory and then making it read-only > later runs risks (see threads about BPF JIT races vs making things > read-only: https://patchwork.kernel.org/patch/9662653/ Alexei's NAK > doesn't change the risk this series is fixing: races with attacker > writes during assignment but before read-only marking). If you are talking about an attacker, rather than protection against accidental overwrites, how hashing can be enough? Couldn't the attacker compromise that too? > So, while smalloc would hugely reduce the window an attacker has > available to change data contents, this API doesn't eliminate it. (To > eliminate it, there would need to be a CPU-local page permission view > that let only the current CPU to the page, and then restore it to > read-only to match the global read-only view.) That or, if one is ready to take the hit, freeze every other possible attack vector. But I'm not sure this could be justifiable. [...] > Ah! In that case, sure. This isn't what the proposed API provided, > though, so let's adjust it to only perform the unseal at destroy time. > That makes it much saner, IMO. "Write once" dynamic allocations, or > "read-only after seal". woalloc? :P yay naming For now I'm still using smalloc. Anything that is either [x]malloc or [yz]malloc is fine, lengthwise. Other options might require some re-formatting. [...] > I think a shared global pool would need to be eliminated for true > write-once semantics. ok [...] >> I'd rather not add extra locking to something that doesn't need it: >> Allocate - write - seal - read, read, read, ... - unseal - destroy. > > Yup, I would be totally fine with this. It still has a race between > allocate and seal, but it's a huge improvement over the existing state > of the world where all dynamic memory is writable. :) Great! [...] > Ah, okay. Most of the LSM is happily covered by __ro_after_init. If we > could just drop the runtime disabling of SELinux, we'd be fine. I am not sure I understand this point. If the kernel is properly configured, the master toggle variable disappears, right? Or do you mean the disabling through modifications of the linked list of the hooks? [...] > Hm, I just meant add a char[] to the metadata and pass it in during > create(). Then it's possible to report which smalloc cache is being > examined during hardened usercopy checks. Ok, that is not a big deal. wrt this, I have spent some time writing a debug module, which currently dumps into a debugfs entry a bunch of info about the various pools. I could split it across multiple entries, using the label to generate their names. [...] > It seems like smalloc pools could also be refcounted? I am not sure what you mean. What do you want to count? Number of pools? Nodes per pool? Allocations per node? And what for? At least in the case of tearing down a pool, when a module is unloaded, nobody needs to free anything that was allocated with smalloc. The teardown function will free the pages from each node. Is this the place where you think there should be a check on the number of pages freed? [...] +#define NODE_HEADER\ + struct {
Re: [PATCH 1/1] Sealable memory support
On 23/05/17 23:11, Kees Cook wrote: > On Tue, May 23, 2017 at 2:43 AM, Igor Stoppa wrote: [...] > I would want hardened usercopy support as a requirement for using > smalloc(). Without it, we're regressing the over-read protection that > already exists for slab objects, if kernel code switched from slab to > smalloc. It should be very similar to the existing slab checks. "Is > this a smalloc object? Have we read beyond the end of a given object?" > etc. The metadata is all there, except for an efficient way to mark a > page as a smalloc page, but I think that just requires a new Page$Name > bit test, as done for slab. ok [...] > I meant this: > > CPU 1 CPU 2 > create > alloc > write > seal > ... > unseal > write > write > seal > > The CPU 2 write would be, for example, an attacker using a > vulnerability to attempt to write to memory in the sealed area. All it > would need to do to succeed would be to trigger an action in the > kernel that would do a "legitimate" write (which requires the unseal), > and race it. Unsealing should be CPU-local, if the API is going to > support this kind of access. I see. If the CPU1 were to forcibly halt anything that can race with it, then it would be sure that there was no interference. A reactive approach could be, instead, to re-validate the content after the sealing, assuming that it is possible. [...] > I am more concerned about _any_ unseal after initial seal. And even > then, it'd be nice to keep things CPU-local. My concerns are related > to the write-rarely proposal (https://lkml.org/lkml/2017/3/29/704) > which is kind of like this, but focused on the .data section, not > dynamic memory. It has similar concerns about CPU-locality. > Additionally, even writing to memory and then making it read-only > later runs risks (see threads about BPF JIT races vs making things > read-only: https://patchwork.kernel.org/patch/9662653/ Alexei's NAK > doesn't change the risk this series is fixing: races with attacker > writes during assignment but before read-only marking). If you are talking about an attacker, rather than protection against accidental overwrites, how hashing can be enough? Couldn't the attacker compromise that too? > So, while smalloc would hugely reduce the window an attacker has > available to change data contents, this API doesn't eliminate it. (To > eliminate it, there would need to be a CPU-local page permission view > that let only the current CPU to the page, and then restore it to > read-only to match the global read-only view.) That or, if one is ready to take the hit, freeze every other possible attack vector. But I'm not sure this could be justifiable. [...] > Ah! In that case, sure. This isn't what the proposed API provided, > though, so let's adjust it to only perform the unseal at destroy time. > That makes it much saner, IMO. "Write once" dynamic allocations, or > "read-only after seal". woalloc? :P yay naming For now I'm still using smalloc. Anything that is either [x]malloc or [yz]malloc is fine, lengthwise. Other options might require some re-formatting. [...] > I think a shared global pool would need to be eliminated for true > write-once semantics. ok [...] >> I'd rather not add extra locking to something that doesn't need it: >> Allocate - write - seal - read, read, read, ... - unseal - destroy. > > Yup, I would be totally fine with this. It still has a race between > allocate and seal, but it's a huge improvement over the existing state > of the world where all dynamic memory is writable. :) Great! [...] > Ah, okay. Most of the LSM is happily covered by __ro_after_init. If we > could just drop the runtime disabling of SELinux, we'd be fine. I am not sure I understand this point. If the kernel is properly configured, the master toggle variable disappears, right? Or do you mean the disabling through modifications of the linked list of the hooks? [...] > Hm, I just meant add a char[] to the metadata and pass it in during > create(). Then it's possible to report which smalloc cache is being > examined during hardened usercopy checks. Ok, that is not a big deal. wrt this, I have spent some time writing a debug module, which currently dumps into a debugfs entry a bunch of info about the various pools. I could split it across multiple entries, using the label to generate their names. [...] > It seems like smalloc pools could also be refcounted? I am not sure what you mean. What do you want to count? Number of pools? Nodes per pool? Allocations per node? And what for? At least in the case of tearing down a pool, when a module is unloaded, nobody needs to free anything that was allocated with smalloc. The teardown function will free the pages from each node. Is this the place where you think there should be a check on the number of pages freed? [...] +#define NODE_HEADER\ + struct {\ +
Re: [PATCH 1/1] Sealable memory support
On Tue, May 23, 2017 at 2:43 AM, Igor Stoppawrote: > On 23/05/17 00:38, Kees Cook wrote: >> On Fri, May 19, 2017 at 3:38 AM, Igor Stoppa wrote: > > [...] > >> For the first bit of bikeshedding, should this really be called >> seal/unseal? My mind is probably just broken from having read TPM >> documentation, but this isn't really "sealing" as I'd understand it >> (it's not tied to a credential, for example). It's "only" rw/ro. >> Perhaps "protect/unprotect" or just simply "readonly/writable", and >> call the base function "romalloc"? > > I was not aware of the specific mean of "seal", in this context. > The term was implicitly proposed by Michal Hocko, while discussing about > the mechanism and I liked it more than what I was using initially: > "lockable". > > tbh I like the sound of "smalloc" better than "romalloc" > > But this is really the least of my worries :-P > >> This is fundamentally a heap allocator, with linked lists, etc. I'd >> like to see as much attention as possible given to hardening it >> against attacks, especially adding redzoning around the metadata at >> least, and perhaps requiring that CONFIG_DEBUG_LIST be enabled. > > My initial goal was to provide something that is useful without > affecting performance. > > You seem to be pushing for a more extreme approach. I don't know about pushing, but it seemed like the API provided for arbitrary unsealing, etc. More below... > While I have nothing against it and I actually agree that it can be > useful, I would not make it mandatory. > > More on this later. > >> And as >> part of that, I'd like hardened usercopy to grow knowledge of these >> allocations so we can bounds-check objects. Right now, mm/usercopy.c >> just looks at PageSlab(page) to decide if it should do slab checks. I >> think adding a check for this type of object would be very important >> there. > > I am not familiar with this and I need to study it, however I still > think that if there is a significant trade-off in terms of performance > vs resilience, it should be optional, for those who want it. I would want hardened usercopy support as a requirement for using smalloc(). Without it, we're regressing the over-read protection that already exists for slab objects, if kernel code switched from slab to smalloc. It should be very similar to the existing slab checks. "Is this a smalloc object? Have we read beyond the end of a given object?" etc. The metadata is all there, except for an efficient way to mark a page as a smalloc page, but I think that just requires a new Page$Name bit test, as done for slab. > Maybe there could be a master toggle for these options, if it makes > sense to group them logically. How does this sound? > >> The ro/rw granularity here is the _entire_ pool, not a specific >> allocation (or page containing the allocation). I'm concerned that >> makes this very open to race conditions where, especially in the >> global pool, one thread can be trying to write to ro data in a pool >> and another has made the pool writable. > > I have the impression we are thinking to something different. > Close, but different enough. > > First of all, how using a mutex can create races? > Do you mean with respect of other resources that might be held by > competing users of the pool? I meant this: CPU 1 CPU 2 create alloc write seal ... unseal write write seal The CPU 2 write would be, for example, an attacker using a vulnerability to attempt to write to memory in the sealed area. All it would need to do to succeed would be to trigger an action in the kernel that would do a "legitimate" write (which requires the unseal), and race it. Unsealing should be CPU-local, if the API is going to support this kind of access. > That, imho, is a locking problem that cannot be solved here. > You can try to mitigate it, by reducing the chances it will happen, but > basically you are trying to make do with an user of the API that is not > implementing locking correctly. > I'd say that it's better to fix the user. > > If you meant something else, I really didn't get it :-) > > More about the frequency of the access: you seem to expect very often > seal/unseal - or lock/unlock, while I don't. I am more concerned about _any_ unseal after initial seal. And even then, it'd be nice to keep things CPU-local. My concerns are related to the write-rarely proposal (https://lkml.org/lkml/2017/3/29/704) which is kind of like this, but focused on the .data section, not dynamic memory. It has similar concerns about CPU-locality. Additionally, even writing to memory and then making it read-only later runs risks (see threads about BPF JIT races vs making things read-only: https://patchwork.kernel.org/patch/9662653/ Alexei's NAK doesn't change the risk this series is fixing: races with attacker writes during assignment but before read-only marking). So, while smalloc would hugely reduce the window an attacker has available to
Re: [PATCH 1/1] Sealable memory support
On Tue, May 23, 2017 at 2:43 AM, Igor Stoppa wrote: > On 23/05/17 00:38, Kees Cook wrote: >> On Fri, May 19, 2017 at 3:38 AM, Igor Stoppa wrote: > > [...] > >> For the first bit of bikeshedding, should this really be called >> seal/unseal? My mind is probably just broken from having read TPM >> documentation, but this isn't really "sealing" as I'd understand it >> (it's not tied to a credential, for example). It's "only" rw/ro. >> Perhaps "protect/unprotect" or just simply "readonly/writable", and >> call the base function "romalloc"? > > I was not aware of the specific mean of "seal", in this context. > The term was implicitly proposed by Michal Hocko, while discussing about > the mechanism and I liked it more than what I was using initially: > "lockable". > > tbh I like the sound of "smalloc" better than "romalloc" > > But this is really the least of my worries :-P > >> This is fundamentally a heap allocator, with linked lists, etc. I'd >> like to see as much attention as possible given to hardening it >> against attacks, especially adding redzoning around the metadata at >> least, and perhaps requiring that CONFIG_DEBUG_LIST be enabled. > > My initial goal was to provide something that is useful without > affecting performance. > > You seem to be pushing for a more extreme approach. I don't know about pushing, but it seemed like the API provided for arbitrary unsealing, etc. More below... > While I have nothing against it and I actually agree that it can be > useful, I would not make it mandatory. > > More on this later. > >> And as >> part of that, I'd like hardened usercopy to grow knowledge of these >> allocations so we can bounds-check objects. Right now, mm/usercopy.c >> just looks at PageSlab(page) to decide if it should do slab checks. I >> think adding a check for this type of object would be very important >> there. > > I am not familiar with this and I need to study it, however I still > think that if there is a significant trade-off in terms of performance > vs resilience, it should be optional, for those who want it. I would want hardened usercopy support as a requirement for using smalloc(). Without it, we're regressing the over-read protection that already exists for slab objects, if kernel code switched from slab to smalloc. It should be very similar to the existing slab checks. "Is this a smalloc object? Have we read beyond the end of a given object?" etc. The metadata is all there, except for an efficient way to mark a page as a smalloc page, but I think that just requires a new Page$Name bit test, as done for slab. > Maybe there could be a master toggle for these options, if it makes > sense to group them logically. How does this sound? > >> The ro/rw granularity here is the _entire_ pool, not a specific >> allocation (or page containing the allocation). I'm concerned that >> makes this very open to race conditions where, especially in the >> global pool, one thread can be trying to write to ro data in a pool >> and another has made the pool writable. > > I have the impression we are thinking to something different. > Close, but different enough. > > First of all, how using a mutex can create races? > Do you mean with respect of other resources that might be held by > competing users of the pool? I meant this: CPU 1 CPU 2 create alloc write seal ... unseal write write seal The CPU 2 write would be, for example, an attacker using a vulnerability to attempt to write to memory in the sealed area. All it would need to do to succeed would be to trigger an action in the kernel that would do a "legitimate" write (which requires the unseal), and race it. Unsealing should be CPU-local, if the API is going to support this kind of access. > That, imho, is a locking problem that cannot be solved here. > You can try to mitigate it, by reducing the chances it will happen, but > basically you are trying to make do with an user of the API that is not > implementing locking correctly. > I'd say that it's better to fix the user. > > If you meant something else, I really didn't get it :-) > > More about the frequency of the access: you seem to expect very often > seal/unseal - or lock/unlock, while I don't. I am more concerned about _any_ unseal after initial seal. And even then, it'd be nice to keep things CPU-local. My concerns are related to the write-rarely proposal (https://lkml.org/lkml/2017/3/29/704) which is kind of like this, but focused on the .data section, not dynamic memory. It has similar concerns about CPU-locality. Additionally, even writing to memory and then making it read-only later runs risks (see threads about BPF JIT races vs making things read-only: https://patchwork.kernel.org/patch/9662653/ Alexei's NAK doesn't change the risk this series is fixing: races with attacker writes during assignment but before read-only marking). So, while smalloc would hugely reduce the window an attacker has available to change data contents, this API doesn't
Re: [PATCH 1/1] Sealable memory support
On 23/05/17 00:38, Kees Cook wrote: > On Fri, May 19, 2017 at 3:38 AM, Igor Stoppawrote: [...] > For the first bit of bikeshedding, should this really be called > seal/unseal? My mind is probably just broken from having read TPM > documentation, but this isn't really "sealing" as I'd understand it > (it's not tied to a credential, for example). It's "only" rw/ro. > Perhaps "protect/unprotect" or just simply "readonly/writable", and > call the base function "romalloc"? I was not aware of the specific mean of "seal", in this context. The term was implicitly proposed by Michal Hocko, while discussing about the mechanism and I liked it more than what I was using initially: "lockable". tbh I like the sound of "smalloc" better than "romalloc" But this is really the least of my worries :-P > This is fundamentally a heap allocator, with linked lists, etc. I'd > like to see as much attention as possible given to hardening it > against attacks, especially adding redzoning around the metadata at > least, and perhaps requiring that CONFIG_DEBUG_LIST be enabled. My initial goal was to provide something that is useful without affecting performance. You seem to be pushing for a more extreme approach. While I have nothing against it and I actually agree that it can be useful, I would not make it mandatory. More on this later. > And as > part of that, I'd like hardened usercopy to grow knowledge of these > allocations so we can bounds-check objects. Right now, mm/usercopy.c > just looks at PageSlab(page) to decide if it should do slab checks. I > think adding a check for this type of object would be very important > there. I am not familiar with this and I need to study it, however I still think that if there is a significant trade-off in terms of performance vs resilience, it should be optional, for those who want it. Maybe there could be a master toggle for these options, if it makes sense to group them logically. How does this sound? > The ro/rw granularity here is the _entire_ pool, not a specific > allocation (or page containing the allocation). I'm concerned that > makes this very open to race conditions where, especially in the > global pool, one thread can be trying to write to ro data in a pool > and another has made the pool writable. I have the impression we are thinking to something different. Close, but different enough. First of all, how using a mutex can create races? Do you mean with respect of other resources that might be held by competing users of the pool? That, imho, is a locking problem that cannot be solved here. You can try to mitigate it, by reducing the chances it will happen, but basically you are trying to make do with an user of the API that is not implementing locking correctly. I'd say that it's better to fix the user. If you meant something else, I really didn't get it :-) More about the frequency of the access: you seem to expect very often seal/unseal - or lock/unlock, while I don't. What I envision as primary case for unlocking is the tear down of the entire pool, for example preceding the unloading of the module. Think about __ro_after_init : once it is r/o, it stays r/o. Which also means that there would be possibly a busy transient, when allocations are made. That is true. But only for shared pools. If a module uses one pool, I would expect the initialization of the module to be mostly sequential. So, no chance of races. Do you agree? Furthermore, if a module _really_ wants to do parallel allocation, then maybe it's simpler and cleaner to have one pool per "thread" or whatever is the mechanism used. The global pool is mostly there for completing the offering of __ro_after_init. If one wants ot use it, it's available. It saves the trouble of having own pool, it means competing with the rest of the kernel for locking. If it seems a bad idea, it can be easily removed. [...] >> +#include "smalloc.h" > > Shouldn't this just be ? yes, I have fixed it in the next revision [...] >> + /* If no pool specified, use the global one. */ >> + if (!pool) >> + pool = global_pool; > > It should be impossible, but this should check for global_pool == NULL too, > IMO. I'm curious: why do you think it could happen? I'd rather throw an exception, if I cannot initialize global_pool. [...] >> + if (pool->seal == seal) { >> + mutex_unlock(>lock); >> + return; > > I actually think this should be a BUG condition, since this means a > mismatched seal/unseal happened. The pool should never be left > writable at rest, a user should create a pool, write to it, seal. On this I agree. > Any > updates should unseal, write, seal. To attempt an unseal and find it > already unsealed seems bad. But what you are describing seems exactly the case where there could be a race. EX: 2 writers try to unseal and add/change something. They could have a collision both when trying to seal and unseal. One
Re: [PATCH 1/1] Sealable memory support
On 23/05/17 00:38, Kees Cook wrote: > On Fri, May 19, 2017 at 3:38 AM, Igor Stoppa wrote: [...] > For the first bit of bikeshedding, should this really be called > seal/unseal? My mind is probably just broken from having read TPM > documentation, but this isn't really "sealing" as I'd understand it > (it's not tied to a credential, for example). It's "only" rw/ro. > Perhaps "protect/unprotect" or just simply "readonly/writable", and > call the base function "romalloc"? I was not aware of the specific mean of "seal", in this context. The term was implicitly proposed by Michal Hocko, while discussing about the mechanism and I liked it more than what I was using initially: "lockable". tbh I like the sound of "smalloc" better than "romalloc" But this is really the least of my worries :-P > This is fundamentally a heap allocator, with linked lists, etc. I'd > like to see as much attention as possible given to hardening it > against attacks, especially adding redzoning around the metadata at > least, and perhaps requiring that CONFIG_DEBUG_LIST be enabled. My initial goal was to provide something that is useful without affecting performance. You seem to be pushing for a more extreme approach. While I have nothing against it and I actually agree that it can be useful, I would not make it mandatory. More on this later. > And as > part of that, I'd like hardened usercopy to grow knowledge of these > allocations so we can bounds-check objects. Right now, mm/usercopy.c > just looks at PageSlab(page) to decide if it should do slab checks. I > think adding a check for this type of object would be very important > there. I am not familiar with this and I need to study it, however I still think that if there is a significant trade-off in terms of performance vs resilience, it should be optional, for those who want it. Maybe there could be a master toggle for these options, if it makes sense to group them logically. How does this sound? > The ro/rw granularity here is the _entire_ pool, not a specific > allocation (or page containing the allocation). I'm concerned that > makes this very open to race conditions where, especially in the > global pool, one thread can be trying to write to ro data in a pool > and another has made the pool writable. I have the impression we are thinking to something different. Close, but different enough. First of all, how using a mutex can create races? Do you mean with respect of other resources that might be held by competing users of the pool? That, imho, is a locking problem that cannot be solved here. You can try to mitigate it, by reducing the chances it will happen, but basically you are trying to make do with an user of the API that is not implementing locking correctly. I'd say that it's better to fix the user. If you meant something else, I really didn't get it :-) More about the frequency of the access: you seem to expect very often seal/unseal - or lock/unlock, while I don't. What I envision as primary case for unlocking is the tear down of the entire pool, for example preceding the unloading of the module. Think about __ro_after_init : once it is r/o, it stays r/o. Which also means that there would be possibly a busy transient, when allocations are made. That is true. But only for shared pools. If a module uses one pool, I would expect the initialization of the module to be mostly sequential. So, no chance of races. Do you agree? Furthermore, if a module _really_ wants to do parallel allocation, then maybe it's simpler and cleaner to have one pool per "thread" or whatever is the mechanism used. The global pool is mostly there for completing the offering of __ro_after_init. If one wants ot use it, it's available. It saves the trouble of having own pool, it means competing with the rest of the kernel for locking. If it seems a bad idea, it can be easily removed. [...] >> +#include "smalloc.h" > > Shouldn't this just be ? yes, I have fixed it in the next revision [...] >> + /* If no pool specified, use the global one. */ >> + if (!pool) >> + pool = global_pool; > > It should be impossible, but this should check for global_pool == NULL too, > IMO. I'm curious: why do you think it could happen? I'd rather throw an exception, if I cannot initialize global_pool. [...] >> + if (pool->seal == seal) { >> + mutex_unlock(>lock); >> + return; > > I actually think this should be a BUG condition, since this means a > mismatched seal/unseal happened. The pool should never be left > writable at rest, a user should create a pool, write to it, seal. On this I agree. > Any > updates should unseal, write, seal. To attempt an unseal and find it > already unsealed seems bad. But what you are describing seems exactly the case where there could be a race. EX: 2 writers try to unseal and add/change something. They could have a collision both when trying to seal and unseal. One could argue that there
Re: [PATCH 1/1] Sealable memory support
On Fri, May 19, 2017 at 3:38 AM, Igor Stoppawrote: > Dynamically allocated variables can be made read only, > after they have been initialized, provided that they reside in memory > pages devoid of any RW data. > > The implementation supplies means to create independent pools of memory, > which can be individually created, sealed/unsealed and destroyed. This would be a welcome addition, thanks for posting it! I have a bunch of feedback, here and below: For the first bit of bikeshedding, should this really be called seal/unseal? My mind is probably just broken from having read TPM documentation, but this isn't really "sealing" as I'd understand it (it's not tied to a credential, for example). It's "only" rw/ro. Perhaps "protect/unprotect" or just simply "readonly/writable", and call the base function "romalloc"? This is fundamentally a heap allocator, with linked lists, etc. I'd like to see as much attention as possible given to hardening it against attacks, especially adding redzoning around the metadata at least, and perhaps requiring that CONFIG_DEBUG_LIST be enabled. And as part of that, I'd like hardened usercopy to grow knowledge of these allocations so we can bounds-check objects. Right now, mm/usercopy.c just looks at PageSlab(page) to decide if it should do slab checks. I think adding a check for this type of object would be very important there. The ro/rw granularity here is the _entire_ pool, not a specific allocation (or page containing the allocation). I'm concerned that makes this very open to race conditions where, especially in the global pool, one thread can be trying to write to ro data in a pool and another has made the pool writable. > A global pool is made available for those kernel modules that do not > need to manage an independent pool. > > Signed-off-by: Igor Stoppa > --- > mm/Makefile | 2 +- > mm/smalloc.c | 200 > +++ > mm/smalloc.h | 61 ++ > 3 files changed, 262 insertions(+), 1 deletion(-) > create mode 100644 mm/smalloc.c > create mode 100644 mm/smalloc.h > > diff --git a/mm/Makefile b/mm/Makefile > index 026f6a8..737c42a 100644 > --- a/mm/Makefile > +++ b/mm/Makefile > @@ -39,7 +39,7 @@ obj-y := filemap.o mempool.o oom_kill.o \ >mm_init.o mmu_context.o percpu.o slab_common.o \ >compaction.o vmacache.o swap_slots.o \ >interval_tree.o list_lru.o workingset.o \ > - debug.o $(mmu-y) > + debug.o smalloc.o $(mmu-y) > > obj-y += init-mm.o > > diff --git a/mm/smalloc.c b/mm/smalloc.c > new file mode 100644 > index 000..fa04cc5 > --- /dev/null > +++ b/mm/smalloc.c > @@ -0,0 +1,200 @@ > +/* > + * smalloc.c: Sealable Memory Allocator > + * > + * (C) Copyright 2017 Huawei Technologies Co. Ltd. > + * Author: Igor Stoppa > + * > + * This program is free software; you can redistribute it and/or > + * modify it under the terms of the GNU General Public License > + * as published by the Free Software Foundation; version 2 > + * of the License. > + */ > + > +#include > +#include > +#include > +#include > +#include > +#include > +#include > + > + > +#include > +#include > +#include "smalloc.h" Shouldn't this just be ? > +#define page_roundup(size) (((size) + !(size) - 1 + PAGE_SIZE) & PAGE_MASK) > + > +#define pages_nr(size) (page_roundup(size) / PAGE_SIZE) > + > +static struct smalloc_pool *global_pool; > + > +struct smalloc_node *__smalloc_create_node(unsigned long words) > +{ > + struct smalloc_node *node; > + unsigned long size; > + > + /* Calculate the size to ask from vmalloc, page aligned. */ > + size = page_roundup(NODE_HEADER_SIZE + words * sizeof(align_t)); > + node = vmalloc(size); > + if (!node) { > + pr_err("No memory for allocating smalloc node."); > + return NULL; > + } > + /* Initialize the node.*/ > + INIT_LIST_HEAD(>list); > + node->free = node->data; > + node->available_words = (size - NODE_HEADER_SIZE) / sizeof(align_t); > + return node; > +} > + > +static __always_inline > +void *node_alloc(struct smalloc_node *node, unsigned long words) > +{ > + register align_t *old_free = node->free; > + > + node->available_words -= words; > + node->free += words; > + return old_free; > +} > + > +void *smalloc(unsigned long size, struct smalloc_pool *pool) > +{ > + struct list_head *pos; > + struct smalloc_node *node; > + void *ptr; > + unsigned long words; > + > + /* If no pool specified, use the global one. */ > + if (!pool) > + pool = global_pool; It should be impossible, but this should check for global_pool == NULL too, IMO. > + mutex_lock(>lock); > + > + /* If the pool is
Re: [PATCH 1/1] Sealable memory support
On Fri, May 19, 2017 at 3:38 AM, Igor Stoppa wrote: > Dynamically allocated variables can be made read only, > after they have been initialized, provided that they reside in memory > pages devoid of any RW data. > > The implementation supplies means to create independent pools of memory, > which can be individually created, sealed/unsealed and destroyed. This would be a welcome addition, thanks for posting it! I have a bunch of feedback, here and below: For the first bit of bikeshedding, should this really be called seal/unseal? My mind is probably just broken from having read TPM documentation, but this isn't really "sealing" as I'd understand it (it's not tied to a credential, for example). It's "only" rw/ro. Perhaps "protect/unprotect" or just simply "readonly/writable", and call the base function "romalloc"? This is fundamentally a heap allocator, with linked lists, etc. I'd like to see as much attention as possible given to hardening it against attacks, especially adding redzoning around the metadata at least, and perhaps requiring that CONFIG_DEBUG_LIST be enabled. And as part of that, I'd like hardened usercopy to grow knowledge of these allocations so we can bounds-check objects. Right now, mm/usercopy.c just looks at PageSlab(page) to decide if it should do slab checks. I think adding a check for this type of object would be very important there. The ro/rw granularity here is the _entire_ pool, not a specific allocation (or page containing the allocation). I'm concerned that makes this very open to race conditions where, especially in the global pool, one thread can be trying to write to ro data in a pool and another has made the pool writable. > A global pool is made available for those kernel modules that do not > need to manage an independent pool. > > Signed-off-by: Igor Stoppa > --- > mm/Makefile | 2 +- > mm/smalloc.c | 200 > +++ > mm/smalloc.h | 61 ++ > 3 files changed, 262 insertions(+), 1 deletion(-) > create mode 100644 mm/smalloc.c > create mode 100644 mm/smalloc.h > > diff --git a/mm/Makefile b/mm/Makefile > index 026f6a8..737c42a 100644 > --- a/mm/Makefile > +++ b/mm/Makefile > @@ -39,7 +39,7 @@ obj-y := filemap.o mempool.o oom_kill.o \ >mm_init.o mmu_context.o percpu.o slab_common.o \ >compaction.o vmacache.o swap_slots.o \ >interval_tree.o list_lru.o workingset.o \ > - debug.o $(mmu-y) > + debug.o smalloc.o $(mmu-y) > > obj-y += init-mm.o > > diff --git a/mm/smalloc.c b/mm/smalloc.c > new file mode 100644 > index 000..fa04cc5 > --- /dev/null > +++ b/mm/smalloc.c > @@ -0,0 +1,200 @@ > +/* > + * smalloc.c: Sealable Memory Allocator > + * > + * (C) Copyright 2017 Huawei Technologies Co. Ltd. > + * Author: Igor Stoppa > + * > + * This program is free software; you can redistribute it and/or > + * modify it under the terms of the GNU General Public License > + * as published by the Free Software Foundation; version 2 > + * of the License. > + */ > + > +#include > +#include > +#include > +#include > +#include > +#include > +#include > + > + > +#include > +#include > +#include "smalloc.h" Shouldn't this just be ? > +#define page_roundup(size) (((size) + !(size) - 1 + PAGE_SIZE) & PAGE_MASK) > + > +#define pages_nr(size) (page_roundup(size) / PAGE_SIZE) > + > +static struct smalloc_pool *global_pool; > + > +struct smalloc_node *__smalloc_create_node(unsigned long words) > +{ > + struct smalloc_node *node; > + unsigned long size; > + > + /* Calculate the size to ask from vmalloc, page aligned. */ > + size = page_roundup(NODE_HEADER_SIZE + words * sizeof(align_t)); > + node = vmalloc(size); > + if (!node) { > + pr_err("No memory for allocating smalloc node."); > + return NULL; > + } > + /* Initialize the node.*/ > + INIT_LIST_HEAD(>list); > + node->free = node->data; > + node->available_words = (size - NODE_HEADER_SIZE) / sizeof(align_t); > + return node; > +} > + > +static __always_inline > +void *node_alloc(struct smalloc_node *node, unsigned long words) > +{ > + register align_t *old_free = node->free; > + > + node->available_words -= words; > + node->free += words; > + return old_free; > +} > + > +void *smalloc(unsigned long size, struct smalloc_pool *pool) > +{ > + struct list_head *pos; > + struct smalloc_node *node; > + void *ptr; > + unsigned long words; > + > + /* If no pool specified, use the global one. */ > + if (!pool) > + pool = global_pool; It should be impossible, but this should check for global_pool == NULL too, IMO. > + mutex_lock(>lock); > + > + /* If the pool is sealed, then return NULL. */ > + if (pool->seal == SMALLOC_SEALED)