Re: [PATCH 1/1] Sealable memory support

2017-06-03 Thread kbuild test robot
Hi Igor,

[auto build test ERROR on mmotm/master]
[also build test ERROR on v4.12-rc3 next-20170602]
[if your patch is applied to the wrong git tree, please drop us a note to help 
improve the system]

url:
https://github.com/0day-ci/linux/commits/Igor-Stoppa/Sealable-memory-support/20170522-163525
base:   git://git.cmpxchg.org/linux-mmotm.git master
config: i386-randconfig-n0-06032349 (attached as .config)
compiler: gcc-6 (Debian 6.2.0-3) 6.2.0 20160901
reproduce:
# save the attached .config to linux build tree
make ARCH=i386 

All errors (new ones prefixed by >>):

   mm//smalloc.c: In function 'smalloc_seal_set':
>> mm//smalloc.c:135:4: error: implicit declaration of function 'set_memory_ro' 
>> [-Werror=implicit-function-declaration]
   set_memory_ro((unsigned long)node,
   ^
>> mm//smalloc.c:138:4: error: implicit declaration of function 'set_memory_rw' 
>> [-Werror=implicit-function-declaration]
   set_memory_rw((unsigned long)node,
   ^
   cc1: some warnings being treated as errors

vim +/set_memory_ro +135 mm//smalloc.c

   129  mutex_unlock(>lock);
   130  return;
   131  }
   132  list_for_each(pos, >list) {
   133  node = list_entry(pos, struct smalloc_node, list);
   134  if (seal == SMALLOC_SEALED)
 > 135  set_memory_ro((unsigned long)node,
   136get_node_pages_nr(node));
   137  else if (seal == SMALLOC_UNSEALED)
 > 138  set_memory_rw((unsigned long)node,
   139get_node_pages_nr(node));
   140  }
   141  pool->seal = seal;

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: application/gzip


Re: [PATCH 1/1] Sealable memory support

2017-06-03 Thread kbuild test robot
Hi Igor,

[auto build test ERROR on mmotm/master]
[also build test ERROR on v4.12-rc3 next-20170602]
[if your patch is applied to the wrong git tree, please drop us a note to help 
improve the system]

url:
https://github.com/0day-ci/linux/commits/Igor-Stoppa/Sealable-memory-support/20170522-163525
base:   git://git.cmpxchg.org/linux-mmotm.git master
config: i386-randconfig-n0-06032349 (attached as .config)
compiler: gcc-6 (Debian 6.2.0-3) 6.2.0 20160901
reproduce:
# save the attached .config to linux build tree
make ARCH=i386 

All errors (new ones prefixed by >>):

   mm//smalloc.c: In function 'smalloc_seal_set':
>> mm//smalloc.c:135:4: error: implicit declaration of function 'set_memory_ro' 
>> [-Werror=implicit-function-declaration]
   set_memory_ro((unsigned long)node,
   ^
>> mm//smalloc.c:138:4: error: implicit declaration of function 'set_memory_rw' 
>> [-Werror=implicit-function-declaration]
   set_memory_rw((unsigned long)node,
   ^
   cc1: some warnings being treated as errors

vim +/set_memory_ro +135 mm//smalloc.c

   129  mutex_unlock(>lock);
   130  return;
   131  }
   132  list_for_each(pos, >list) {
   133  node = list_entry(pos, struct smalloc_node, list);
   134  if (seal == SMALLOC_SEALED)
 > 135  set_memory_ro((unsigned long)node,
   136get_node_pages_nr(node));
   137  else if (seal == SMALLOC_UNSEALED)
 > 138  set_memory_rw((unsigned long)node,
   139get_node_pages_nr(node));
   140  }
   141  pool->seal = seal;

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: application/gzip


Re: [PATCH 1/1] Sealable memory support

2017-05-31 Thread Igor Stoppa
On 28/05/17 21:23, Kees Cook wrote:
> On Wed, May 24, 2017 at 10:45 AM, Igor Stoppa  wrote:

[...]

>> If the CPU1 were to forcibly halt anything that can race with it, then
>> it would be sure that there was no interference.
> 
> Correct. This is actually what ARM does for doing kernel memory
> writing when poking stuff for kprobes, etc. It's rather dramatic,
> though. :)

ok

>> A reactive approach could be, instead, to re-validate the content after
>> the sealing, assuming that it is possible.
> 
> I would prefer to avoid this, as that allows an attacker to still have
> made the changes (which could even result in them then disabling the
> re-validation during the attack).

ok


[...]

>> If you are talking about an attacker, rather than protection against
>> accidental overwrites, how hashing can be enough?
>> Couldn't the attacker compromise that too?
> 
> In theory, yes, though the goal was to dedicate a register to the
> hash, which would make it hard/impossible for an attacker to reach.
> (The BPF JIT situation is just an example of this kind of race,
> though. I'm still in favor of reducing the write window to init-time
> from full run-time.)

It looks like some compromise is needed between reliability and
performance ...

[...]

> I would expect other people would NAK using "stop all other CPUs"
> approach. Better to have the CPU-local writes.

ok, that could be a feature for the follow-up?

[...]

> Yeah, I don't have any better idea for names. :)

still clinging to smalloc for now, how about pmalloc,
as "protected malloc"?


> We might be talking past each-other. Right now, the LSM is marked with
> __ro_after_init, which will make all the list structures, entries, etc
> read-only already. There is one case where this is not true, and
> that's for CONFIG_SECURITY_WRITABLE_HOOKS for
> CONFIG_SECURITY_SELINUX_DISABLE, which wants to do run-time removal of
> SELinux. 

On this I'll just shut up and provide the patch. It seems easier :-)

> Are you talking about the SELinux policy installed during
> early boot? Which things did you want to use smalloc() on?

The policy DB.
And of course the policy cache has to go, as in: it must be possible to
disable it completely, in favor of re-evaluating the policies, when needed.

I'm pretty sure that it is absolutely needed in certain cases, but in
others I think the re-evaluation is negligible in comparison to the
overall duration of the use cases affected.

So it should be ok to let the system integrator gauge speed (if any) vs
resilience, by enablying/disabling the SE Linux policy cache.


>>> It seems like smalloc pools could also be refcounted?
> 
> I meant things that point into an smalloc() pool could keep a refcount
> and when nothing was pointing into it, it could be destroyed. (i.e.
> using refcount garbage collection instead of explicit destruction.)

ok, but it depends on the level of paranoia that we are talking about.

Counting smalloc vs sfree is easy.

Verifying that the free invocations are actually aligned with previous
allocations is a bit more onerous.

Validating that the free is done from the same entity that invoked the
smalloc might be hard.

>> And what for?
> 
> It might be easier to reason about later if allocations get complex.
> It's certainly not required for the first version of this.

good :-)
more clarifications are needed

[...]

> I don't really have an opinion on this. It might be more readable with
> a named structure?

Eventually I got rid of the macro.
The structure is intentionally unnamed to stress the grouping, without
adding a layer of indirection. But maybe it could be removed. I need to
test it.

>> One more thing: how should I tie this allocator to the rest?
>> I have verified that is seems to work with both SLUB and SLAB.
>> Can I make it depend on either of them being enabled?
> 
> It seems totally unrelated. The only relationship I see would be
> interaction with hardened usercopy. In a perfect world, none of the
> smalloc pools would be allowed to be copied to/from userspace, which
> would make integration really easy: if smalloc_pool(ptr) return NOPE;
> :P

I went down the harder path and implemented the check.

>> Should it be optionally enabled?
>> What to default to, if it's not enabled? vmalloc?
> 
> I don't see any reason to make it optional.

great \o/

So now I'm back to the LSM use case.
It shouldn't take too long.

--
thanks, igor


Re: [PATCH 1/1] Sealable memory support

2017-05-31 Thread Igor Stoppa
On 28/05/17 21:23, Kees Cook wrote:
> On Wed, May 24, 2017 at 10:45 AM, Igor Stoppa  wrote:

[...]

>> If the CPU1 were to forcibly halt anything that can race with it, then
>> it would be sure that there was no interference.
> 
> Correct. This is actually what ARM does for doing kernel memory
> writing when poking stuff for kprobes, etc. It's rather dramatic,
> though. :)

ok

>> A reactive approach could be, instead, to re-validate the content after
>> the sealing, assuming that it is possible.
> 
> I would prefer to avoid this, as that allows an attacker to still have
> made the changes (which could even result in them then disabling the
> re-validation during the attack).

ok


[...]

>> If you are talking about an attacker, rather than protection against
>> accidental overwrites, how hashing can be enough?
>> Couldn't the attacker compromise that too?
> 
> In theory, yes, though the goal was to dedicate a register to the
> hash, which would make it hard/impossible for an attacker to reach.
> (The BPF JIT situation is just an example of this kind of race,
> though. I'm still in favor of reducing the write window to init-time
> from full run-time.)

It looks like some compromise is needed between reliability and
performance ...

[...]

> I would expect other people would NAK using "stop all other CPUs"
> approach. Better to have the CPU-local writes.

ok, that could be a feature for the follow-up?

[...]

> Yeah, I don't have any better idea for names. :)

still clinging to smalloc for now, how about pmalloc,
as "protected malloc"?


> We might be talking past each-other. Right now, the LSM is marked with
> __ro_after_init, which will make all the list structures, entries, etc
> read-only already. There is one case where this is not true, and
> that's for CONFIG_SECURITY_WRITABLE_HOOKS for
> CONFIG_SECURITY_SELINUX_DISABLE, which wants to do run-time removal of
> SELinux. 

On this I'll just shut up and provide the patch. It seems easier :-)

> Are you talking about the SELinux policy installed during
> early boot? Which things did you want to use smalloc() on?

The policy DB.
And of course the policy cache has to go, as in: it must be possible to
disable it completely, in favor of re-evaluating the policies, when needed.

I'm pretty sure that it is absolutely needed in certain cases, but in
others I think the re-evaluation is negligible in comparison to the
overall duration of the use cases affected.

So it should be ok to let the system integrator gauge speed (if any) vs
resilience, by enablying/disabling the SE Linux policy cache.


>>> It seems like smalloc pools could also be refcounted?
> 
> I meant things that point into an smalloc() pool could keep a refcount
> and when nothing was pointing into it, it could be destroyed. (i.e.
> using refcount garbage collection instead of explicit destruction.)

ok, but it depends on the level of paranoia that we are talking about.

Counting smalloc vs sfree is easy.

Verifying that the free invocations are actually aligned with previous
allocations is a bit more onerous.

Validating that the free is done from the same entity that invoked the
smalloc might be hard.

>> And what for?
> 
> It might be easier to reason about later if allocations get complex.
> It's certainly not required for the first version of this.

good :-)
more clarifications are needed

[...]

> I don't really have an opinion on this. It might be more readable with
> a named structure?

Eventually I got rid of the macro.
The structure is intentionally unnamed to stress the grouping, without
adding a layer of indirection. But maybe it could be removed. I need to
test it.

>> One more thing: how should I tie this allocator to the rest?
>> I have verified that is seems to work with both SLUB and SLAB.
>> Can I make it depend on either of them being enabled?
> 
> It seems totally unrelated. The only relationship I see would be
> interaction with hardened usercopy. In a perfect world, none of the
> smalloc pools would be allowed to be copied to/from userspace, which
> would make integration really easy: if smalloc_pool(ptr) return NOPE;
> :P

I went down the harder path and implemented the check.

>> Should it be optionally enabled?
>> What to default to, if it's not enabled? vmalloc?
> 
> I don't see any reason to make it optional.

great \o/

So now I'm back to the LSM use case.
It shouldn't take too long.

--
thanks, igor


Re: [PATCH 1/1] Sealable memory support

2017-05-31 Thread kbuild test robot
Hi Igor,

[auto build test ERROR on mmotm/master]
[also build test ERROR on v4.12-rc3 next-20170531]
[if your patch is applied to the wrong git tree, please drop us a note to help 
improve the system]

url:
https://github.com/0day-ci/linux/commits/Igor-Stoppa/Sealable-memory-support/20170522-163525
base:   git://git.cmpxchg.org/linux-mmotm.git master
config: x86_64-randconfig-v0-05311736 (attached as .config)
compiler: gcc-4.4 (Debian 4.4.7-8) 4.4.7
reproduce:
# save the attached .config to linux build tree
make ARCH=x86_64 

All errors (new ones prefixed by >>):

   In file included from mm/smalloc.c:24:
>> mm/smalloc.h:47: error: flexible array member in otherwise empty struct
   mm/smalloc.c: In function 'smalloc_seal_set':
   mm/smalloc.c:135: error: implicit declaration of function 'set_memory_ro'
   mm/smalloc.c:138: error: implicit declaration of function 'set_memory_rw'

vim +47 mm/smalloc.h

41  struct mutex lock;
42  enum seal_t seal;
43  };
44  
45  struct smalloc_node {
46  NODE_HEADER;
  > 47  __SMALLOC_ALIGNED__ align_t data[];
48  };
49  
50  #define smalloc_seal(pool) \

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: application/gzip


Re: [PATCH 1/1] Sealable memory support

2017-05-31 Thread kbuild test robot
Hi Igor,

[auto build test ERROR on mmotm/master]
[also build test ERROR on v4.12-rc3 next-20170531]
[if your patch is applied to the wrong git tree, please drop us a note to help 
improve the system]

url:
https://github.com/0day-ci/linux/commits/Igor-Stoppa/Sealable-memory-support/20170522-163525
base:   git://git.cmpxchg.org/linux-mmotm.git master
config: x86_64-randconfig-v0-05311736 (attached as .config)
compiler: gcc-4.4 (Debian 4.4.7-8) 4.4.7
reproduce:
# save the attached .config to linux build tree
make ARCH=x86_64 

All errors (new ones prefixed by >>):

   In file included from mm/smalloc.c:24:
>> mm/smalloc.h:47: error: flexible array member in otherwise empty struct
   mm/smalloc.c: In function 'smalloc_seal_set':
   mm/smalloc.c:135: error: implicit declaration of function 'set_memory_ro'
   mm/smalloc.c:138: error: implicit declaration of function 'set_memory_rw'

vim +47 mm/smalloc.h

41  struct mutex lock;
42  enum seal_t seal;
43  };
44  
45  struct smalloc_node {
46  NODE_HEADER;
  > 47  __SMALLOC_ALIGNED__ align_t data[];
48  };
49  
50  #define smalloc_seal(pool) \

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: application/gzip


Re: [kernel-hardening] Re: [PATCH 1/1] Sealable memory support

2017-05-29 Thread Boris Lukashev
If i understand the current direction for smalloc, its to implement it
without the ability to "unseal," which has implications on how LSM
implementations and other users of these dynamic allocations handle
things. If its implemented without a writeable interface for modules
which need it, then switched to a CoW (or other
controlled/isolated/hardened) mechanism for creating writeable
segments in dynamic allocations, that would create another set of
breaking changes for the consumers adapting to the first set. In the
spirit of adopting things faster/smoother, maybe it makes sense to
permit the smalloc re-writing method suggested in the first
implementation; with an API which consumers can later use for a more
secure approach that reduces the attack surface without breaking
consumer interfaces (or internal semantics). A sysctl affecting the
behavior as ro-only or ro-and-sometimes-rw would give users the
flexibility to tune their environment per their needs, which should
also reduce potential conflict/complaints.

On Sun, May 28, 2017 at 5:32 PM, Kees Cook  wrote:
> On Sun, May 28, 2017 at 11:56 AM, Boris Lukashev
>  wrote:
>> So what about a middle ground where CoW semantics are used to enforce
>> the state of these allocations as RO, but provide a strictly
>> controlled pathway to read the RO data, copy and modify it, then write
>> and seal into a new allocation. Successful return from this process
>> should permit the page table to change the pointer to where the object
>> now resides, and initiate freeing of the original memory so long as a
>> refcount is kept for accesses. That way, sealable memory is sealed,
>> and any consumers reading it will be using the original ptr to the
>> original smalloc region. Attackers who do manage to change the
>
> This could be another way to do it, yeah, and it helps that smalloc()
> is built on vmalloc(). It'd require some careful design, but it could
> be a way forward after this initial sealed-after-init version goes in.
>
>> Lastly, my meager understanding is that PAX set the entire kernel as
>> RO, and implemented writeable access via pax_open/close. How were they
>> fighting against race conditions, and what is the benefit of specific
>> regions being allocated this way as opposed to the RO-all-the-things
>> approach which makes writes a specialized set of operations?
>
> My understanding is that PaX's KERNEXEC with the constification plugin
> moves a substantial portion of the kernel's .data section
> (effectively) into the .rodata section. It's not the "entire" kernel.
> (Well, depending on how you count. The .text section is already
> read-only upstream.) PaX, as far as I know, provided no dynamic memory
> allocation protections, like smalloc() would provide.
>
> -Kees
>
> --
> Kees Cook
> Pixel Security



-- 
Boris Lukashev
Systems Architect
Semper Victus


Re: [kernel-hardening] Re: [PATCH 1/1] Sealable memory support

2017-05-29 Thread Boris Lukashev
If i understand the current direction for smalloc, its to implement it
without the ability to "unseal," which has implications on how LSM
implementations and other users of these dynamic allocations handle
things. If its implemented without a writeable interface for modules
which need it, then switched to a CoW (or other
controlled/isolated/hardened) mechanism for creating writeable
segments in dynamic allocations, that would create another set of
breaking changes for the consumers adapting to the first set. In the
spirit of adopting things faster/smoother, maybe it makes sense to
permit the smalloc re-writing method suggested in the first
implementation; with an API which consumers can later use for a more
secure approach that reduces the attack surface without breaking
consumer interfaces (or internal semantics). A sysctl affecting the
behavior as ro-only or ro-and-sometimes-rw would give users the
flexibility to tune their environment per their needs, which should
also reduce potential conflict/complaints.

On Sun, May 28, 2017 at 5:32 PM, Kees Cook  wrote:
> On Sun, May 28, 2017 at 11:56 AM, Boris Lukashev
>  wrote:
>> So what about a middle ground where CoW semantics are used to enforce
>> the state of these allocations as RO, but provide a strictly
>> controlled pathway to read the RO data, copy and modify it, then write
>> and seal into a new allocation. Successful return from this process
>> should permit the page table to change the pointer to where the object
>> now resides, and initiate freeing of the original memory so long as a
>> refcount is kept for accesses. That way, sealable memory is sealed,
>> and any consumers reading it will be using the original ptr to the
>> original smalloc region. Attackers who do manage to change the
>
> This could be another way to do it, yeah, and it helps that smalloc()
> is built on vmalloc(). It'd require some careful design, but it could
> be a way forward after this initial sealed-after-init version goes in.
>
>> Lastly, my meager understanding is that PAX set the entire kernel as
>> RO, and implemented writeable access via pax_open/close. How were they
>> fighting against race conditions, and what is the benefit of specific
>> regions being allocated this way as opposed to the RO-all-the-things
>> approach which makes writes a specialized set of operations?
>
> My understanding is that PaX's KERNEXEC with the constification plugin
> moves a substantial portion of the kernel's .data section
> (effectively) into the .rodata section. It's not the "entire" kernel.
> (Well, depending on how you count. The .text section is already
> read-only upstream.) PaX, as far as I know, provided no dynamic memory
> allocation protections, like smalloc() would provide.
>
> -Kees
>
> --
> Kees Cook
> Pixel Security



-- 
Boris Lukashev
Systems Architect
Semper Victus


Re: [kernel-hardening] Re: [PATCH 1/1] Sealable memory support

2017-05-28 Thread Kees Cook
On Sun, May 28, 2017 at 11:56 AM, Boris Lukashev
 wrote:
> So what about a middle ground where CoW semantics are used to enforce
> the state of these allocations as RO, but provide a strictly
> controlled pathway to read the RO data, copy and modify it, then write
> and seal into a new allocation. Successful return from this process
> should permit the page table to change the pointer to where the object
> now resides, and initiate freeing of the original memory so long as a
> refcount is kept for accesses. That way, sealable memory is sealed,
> and any consumers reading it will be using the original ptr to the
> original smalloc region. Attackers who do manage to change the

This could be another way to do it, yeah, and it helps that smalloc()
is built on vmalloc(). It'd require some careful design, but it could
be a way forward after this initial sealed-after-init version goes in.

> Lastly, my meager understanding is that PAX set the entire kernel as
> RO, and implemented writeable access via pax_open/close. How were they
> fighting against race conditions, and what is the benefit of specific
> regions being allocated this way as opposed to the RO-all-the-things
> approach which makes writes a specialized set of operations?

My understanding is that PaX's KERNEXEC with the constification plugin
moves a substantial portion of the kernel's .data section
(effectively) into the .rodata section. It's not the "entire" kernel.
(Well, depending on how you count. The .text section is already
read-only upstream.) PaX, as far as I know, provided no dynamic memory
allocation protections, like smalloc() would provide.

-Kees

-- 
Kees Cook
Pixel Security


Re: [kernel-hardening] Re: [PATCH 1/1] Sealable memory support

2017-05-28 Thread Kees Cook
On Sun, May 28, 2017 at 11:56 AM, Boris Lukashev
 wrote:
> So what about a middle ground where CoW semantics are used to enforce
> the state of these allocations as RO, but provide a strictly
> controlled pathway to read the RO data, copy and modify it, then write
> and seal into a new allocation. Successful return from this process
> should permit the page table to change the pointer to where the object
> now resides, and initiate freeing of the original memory so long as a
> refcount is kept for accesses. That way, sealable memory is sealed,
> and any consumers reading it will be using the original ptr to the
> original smalloc region. Attackers who do manage to change the

This could be another way to do it, yeah, and it helps that smalloc()
is built on vmalloc(). It'd require some careful design, but it could
be a way forward after this initial sealed-after-init version goes in.

> Lastly, my meager understanding is that PAX set the entire kernel as
> RO, and implemented writeable access via pax_open/close. How were they
> fighting against race conditions, and what is the benefit of specific
> regions being allocated this way as opposed to the RO-all-the-things
> approach which makes writes a specialized set of operations?

My understanding is that PaX's KERNEXEC with the constification plugin
moves a substantial portion of the kernel's .data section
(effectively) into the .rodata section. It's not the "entire" kernel.
(Well, depending on how you count. The .text section is already
read-only upstream.) PaX, as far as I know, provided no dynamic memory
allocation protections, like smalloc() would provide.

-Kees

-- 
Kees Cook
Pixel Security


Re: [kernel-hardening] Re: [PATCH 1/1] Sealable memory support

2017-05-28 Thread Boris Lukashev
One-time sealable memory makes the most sense from a defensive
perspective - red team reads this stuff, the races mentioned will be
implemented as described to win the day, and probably in other
innovative ways. If a gap is left in the implementation, without
explicit coverage by an adjacent function, it will be used no matter
how small the chances of it occurring in the real world are - grooming
systems to create unlikely conditions is fair play (look at
eternalblue's SMB pool machinations).
However, out of tree modules will likely not appreciate this - third
party LSMs, tpe module, SCST, etc. I dont want to get into the "NIH"
debate, they're real functional components, used by real companies,
often enough that they need to be considered a member of the
ecosystem, even if not a first-order member.
So what about a middle ground where CoW semantics are used to enforce
the state of these allocations as RO, but provide a strictly
controlled pathway to read the RO data, copy and modify it, then write
and seal into a new allocation. Successful return from this process
should permit the page table to change the pointer to where the object
now resides, and initiate freeing of the original memory so long as a
refcount is kept for accesses. That way, sealable memory is sealed,
and any consumers reading it will be using the original ptr to the
original smalloc region. Attackers who do manage to change the
allocation by writing a new one still have to figure out how to change
the region used by an existing consumer. New consumers will get hit by
whatever they changed, but the change will be tracked and require them
to gain execution control over the allocator itself in order to affect
the change (or ROP chain something else into doing it, but thats's a
discussion on RAP/CFI).
CPU-local ops are great, if they dont halt the other cores. Stopping
all other CPUs is going to be DoA in HPC and other CPU intensive
workloads - think what ZFS would do if its pipelines kept getting
halted by something running a lot of smallocs (they get
non-preemptible often enough, requiring waits on both sides of the
op), how how LIO would behave - iSCSI waits for no man or woman, or
their allocator strategies. I'm all for "security should be more of a
concern than performance - its easier to build a faster car later if
you're still alive to do it," but keeping in mind who the consumers
are, i can easily see this functionality staying disabled in most
distributions and thus receiving much less testing and beatdown than
it should.
Lastly, my meager understanding is that PAX set the entire kernel as
RO, and implemented writeable access via pax_open/close. How were they
fighting against race conditions, and what is the benefit of specific
regions being allocated this way as opposed to the RO-all-the-things
approach which makes writes a specialized set of operations?

On Sun, May 28, 2017 at 2:23 PM, Kees Cook  wrote:
> On Wed, May 24, 2017 at 10:45 AM, Igor Stoppa  wrote:
>> On 23/05/17 23:11, Kees Cook wrote:
>>> On Tue, May 23, 2017 at 2:43 AM, Igor Stoppa  wrote:
>>> I meant this:
>>>
>>> CPU 1 CPU 2
>>> create
>>> alloc
>>> write
>>> seal
>>> ...
>>> unseal
>>> write
>>> write
>>> seal
>>>
>>> The CPU 2 write would be, for example, an attacker using a
>>> vulnerability to attempt to write to memory in the sealed area. All it
>>> would need to do to succeed would be to trigger an action in the
>>> kernel that would do a "legitimate" write (which requires the unseal),
>>> and race it. Unsealing should be CPU-local, if the API is going to
>>> support this kind of access.
>>
>> I see.
>> If the CPU1 were to forcibly halt anything that can race with it, then
>> it would be sure that there was no interference.
>
> Correct. This is actually what ARM does for doing kernel memory
> writing when poking stuff for kprobes, etc. It's rather dramatic,
> though. :)
>
>> A reactive approach could be, instead, to re-validate the content after
>> the sealing, assuming that it is possible.
>
> I would prefer to avoid this, as that allows an attacker to still have
> made the changes (which could even result in them then disabling the
> re-validation during the attack).
>
>>> I am more concerned about _any_ unseal after initial seal. And even
>>> then, it'd be nice to keep things CPU-local. My concerns are related
>>> to the write-rarely proposal (https://lkml.org/lkml/2017/3/29/704)
>>> which is kind of like this, but focused on the .data section, not
>>> dynamic memory. It has similar concerns about CPU-locality.
>>> Additionally, even writing to memory and then making it read-only
>>> later runs risks (see threads about BPF JIT races vs making things
>>> read-only: https://patchwork.kernel.org/patch/9662653/ Alexei's NAK
>>> doesn't change the risk this series is fixing: races with attacker
>>> writes during assignment but before read-only marking).
>>
>> If you are 

Re: [kernel-hardening] Re: [PATCH 1/1] Sealable memory support

2017-05-28 Thread Boris Lukashev
One-time sealable memory makes the most sense from a defensive
perspective - red team reads this stuff, the races mentioned will be
implemented as described to win the day, and probably in other
innovative ways. If a gap is left in the implementation, without
explicit coverage by an adjacent function, it will be used no matter
how small the chances of it occurring in the real world are - grooming
systems to create unlikely conditions is fair play (look at
eternalblue's SMB pool machinations).
However, out of tree modules will likely not appreciate this - third
party LSMs, tpe module, SCST, etc. I dont want to get into the "NIH"
debate, they're real functional components, used by real companies,
often enough that they need to be considered a member of the
ecosystem, even if not a first-order member.
So what about a middle ground where CoW semantics are used to enforce
the state of these allocations as RO, but provide a strictly
controlled pathway to read the RO data, copy and modify it, then write
and seal into a new allocation. Successful return from this process
should permit the page table to change the pointer to where the object
now resides, and initiate freeing of the original memory so long as a
refcount is kept for accesses. That way, sealable memory is sealed,
and any consumers reading it will be using the original ptr to the
original smalloc region. Attackers who do manage to change the
allocation by writing a new one still have to figure out how to change
the region used by an existing consumer. New consumers will get hit by
whatever they changed, but the change will be tracked and require them
to gain execution control over the allocator itself in order to affect
the change (or ROP chain something else into doing it, but thats's a
discussion on RAP/CFI).
CPU-local ops are great, if they dont halt the other cores. Stopping
all other CPUs is going to be DoA in HPC and other CPU intensive
workloads - think what ZFS would do if its pipelines kept getting
halted by something running a lot of smallocs (they get
non-preemptible often enough, requiring waits on both sides of the
op), how how LIO would behave - iSCSI waits for no man or woman, or
their allocator strategies. I'm all for "security should be more of a
concern than performance - its easier to build a faster car later if
you're still alive to do it," but keeping in mind who the consumers
are, i can easily see this functionality staying disabled in most
distributions and thus receiving much less testing and beatdown than
it should.
Lastly, my meager understanding is that PAX set the entire kernel as
RO, and implemented writeable access via pax_open/close. How were they
fighting against race conditions, and what is the benefit of specific
regions being allocated this way as opposed to the RO-all-the-things
approach which makes writes a specialized set of operations?

On Sun, May 28, 2017 at 2:23 PM, Kees Cook  wrote:
> On Wed, May 24, 2017 at 10:45 AM, Igor Stoppa  wrote:
>> On 23/05/17 23:11, Kees Cook wrote:
>>> On Tue, May 23, 2017 at 2:43 AM, Igor Stoppa  wrote:
>>> I meant this:
>>>
>>> CPU 1 CPU 2
>>> create
>>> alloc
>>> write
>>> seal
>>> ...
>>> unseal
>>> write
>>> write
>>> seal
>>>
>>> The CPU 2 write would be, for example, an attacker using a
>>> vulnerability to attempt to write to memory in the sealed area. All it
>>> would need to do to succeed would be to trigger an action in the
>>> kernel that would do a "legitimate" write (which requires the unseal),
>>> and race it. Unsealing should be CPU-local, if the API is going to
>>> support this kind of access.
>>
>> I see.
>> If the CPU1 were to forcibly halt anything that can race with it, then
>> it would be sure that there was no interference.
>
> Correct. This is actually what ARM does for doing kernel memory
> writing when poking stuff for kprobes, etc. It's rather dramatic,
> though. :)
>
>> A reactive approach could be, instead, to re-validate the content after
>> the sealing, assuming that it is possible.
>
> I would prefer to avoid this, as that allows an attacker to still have
> made the changes (which could even result in them then disabling the
> re-validation during the attack).
>
>>> I am more concerned about _any_ unseal after initial seal. And even
>>> then, it'd be nice to keep things CPU-local. My concerns are related
>>> to the write-rarely proposal (https://lkml.org/lkml/2017/3/29/704)
>>> which is kind of like this, but focused on the .data section, not
>>> dynamic memory. It has similar concerns about CPU-locality.
>>> Additionally, even writing to memory and then making it read-only
>>> later runs risks (see threads about BPF JIT races vs making things
>>> read-only: https://patchwork.kernel.org/patch/9662653/ Alexei's NAK
>>> doesn't change the risk this series is fixing: races with attacker
>>> writes during assignment but before read-only marking).
>>
>> If you are talking about an attacker, rather than protection against
>> accidental 

Re: [PATCH 1/1] Sealable memory support

2017-05-28 Thread Kees Cook
On Wed, May 24, 2017 at 10:45 AM, Igor Stoppa  wrote:
> On 23/05/17 23:11, Kees Cook wrote:
>> On Tue, May 23, 2017 at 2:43 AM, Igor Stoppa  wrote:
>> I meant this:
>>
>> CPU 1 CPU 2
>> create
>> alloc
>> write
>> seal
>> ...
>> unseal
>> write
>> write
>> seal
>>
>> The CPU 2 write would be, for example, an attacker using a
>> vulnerability to attempt to write to memory in the sealed area. All it
>> would need to do to succeed would be to trigger an action in the
>> kernel that would do a "legitimate" write (which requires the unseal),
>> and race it. Unsealing should be CPU-local, if the API is going to
>> support this kind of access.
>
> I see.
> If the CPU1 were to forcibly halt anything that can race with it, then
> it would be sure that there was no interference.

Correct. This is actually what ARM does for doing kernel memory
writing when poking stuff for kprobes, etc. It's rather dramatic,
though. :)

> A reactive approach could be, instead, to re-validate the content after
> the sealing, assuming that it is possible.

I would prefer to avoid this, as that allows an attacker to still have
made the changes (which could even result in them then disabling the
re-validation during the attack).

>> I am more concerned about _any_ unseal after initial seal. And even
>> then, it'd be nice to keep things CPU-local. My concerns are related
>> to the write-rarely proposal (https://lkml.org/lkml/2017/3/29/704)
>> which is kind of like this, but focused on the .data section, not
>> dynamic memory. It has similar concerns about CPU-locality.
>> Additionally, even writing to memory and then making it read-only
>> later runs risks (see threads about BPF JIT races vs making things
>> read-only: https://patchwork.kernel.org/patch/9662653/ Alexei's NAK
>> doesn't change the risk this series is fixing: races with attacker
>> writes during assignment but before read-only marking).
>
> If you are talking about an attacker, rather than protection against
> accidental overwrites, how hashing can be enough?
> Couldn't the attacker compromise that too?

In theory, yes, though the goal was to dedicate a register to the
hash, which would make it hard/impossible for an attacker to reach.
(The BPF JIT situation is just an example of this kind of race,
though. I'm still in favor of reducing the write window to init-time
from full run-time.)

>> So, while smalloc would hugely reduce the window an attacker has
>> available to change data contents, this API doesn't eliminate it. (To
>> eliminate it, there would need to be a CPU-local page permission view
>> that let only the current CPU to the page, and then restore it to
>> read-only to match the global read-only view.)
>
> That or, if one is ready to take the hit, freeze every other possible
> attack vector. But I'm not sure this could be justifiable.

I would expect other people would NAK using "stop all other CPUs"
approach. Better to have the CPU-local writes.

>> Ah! In that case, sure. This isn't what the proposed API provided,
>> though, so let's adjust it to only perform the unseal at destroy time.
>> That makes it much saner, IMO. "Write once" dynamic allocations, or
>> "read-only after seal". woalloc? :P yay naming
>
> For now I'm still using smalloc.
> Anything that is either [x]malloc or [yz]malloc is fine, lengthwise.
> Other options might require some re-formatting.

Yeah, I don't have any better idea for names. :)

>> Ah, okay. Most of the LSM is happily covered by __ro_after_init. If we
>> could just drop the runtime disabling of SELinux, we'd be fine.
>
> I am not sure I understand this point.
> If the kernel is properly configured, the master toggle variable
> disappears, right?
> Or do you mean the disabling through modifications of the linked list of
> the hooks?

We might be talking past each-other. Right now, the LSM is marked with
__ro_after_init, which will make all the list structures, entries, etc
read-only already. There is one case where this is not true, and
that's for CONFIG_SECURITY_WRITABLE_HOOKS for
CONFIG_SECURITY_SELINUX_DISABLE, which wants to do run-time removal of
SELinux. Are you talking about the SELinux policy installed during
early boot? Which things did you want to use smalloc() on?

>> It seems like smalloc pools could also be refcounted?
>
> I am not sure what you mean.
> What do you want to count?
> Number of pools? Nodes per pool? Allocations per node?

I meant things that point into an smalloc() pool could keep a refcount
and when nothing was pointing into it, it could be destroyed. (i.e.
using refcount garbage collection instead of explicit destruction.)

> And what for?

It might be easier to reason about later if allocations get complex.
It's certainly not required for the first version of this.

> At least in the case of tearing down a pool, when a module is unloaded,
> nobody needs to free anything that was allocated with smalloc.
> The teardown function will 

Re: [PATCH 1/1] Sealable memory support

2017-05-28 Thread Kees Cook
On Wed, May 24, 2017 at 10:45 AM, Igor Stoppa  wrote:
> On 23/05/17 23:11, Kees Cook wrote:
>> On Tue, May 23, 2017 at 2:43 AM, Igor Stoppa  wrote:
>> I meant this:
>>
>> CPU 1 CPU 2
>> create
>> alloc
>> write
>> seal
>> ...
>> unseal
>> write
>> write
>> seal
>>
>> The CPU 2 write would be, for example, an attacker using a
>> vulnerability to attempt to write to memory in the sealed area. All it
>> would need to do to succeed would be to trigger an action in the
>> kernel that would do a "legitimate" write (which requires the unseal),
>> and race it. Unsealing should be CPU-local, if the API is going to
>> support this kind of access.
>
> I see.
> If the CPU1 were to forcibly halt anything that can race with it, then
> it would be sure that there was no interference.

Correct. This is actually what ARM does for doing kernel memory
writing when poking stuff for kprobes, etc. It's rather dramatic,
though. :)

> A reactive approach could be, instead, to re-validate the content after
> the sealing, assuming that it is possible.

I would prefer to avoid this, as that allows an attacker to still have
made the changes (which could even result in them then disabling the
re-validation during the attack).

>> I am more concerned about _any_ unseal after initial seal. And even
>> then, it'd be nice to keep things CPU-local. My concerns are related
>> to the write-rarely proposal (https://lkml.org/lkml/2017/3/29/704)
>> which is kind of like this, but focused on the .data section, not
>> dynamic memory. It has similar concerns about CPU-locality.
>> Additionally, even writing to memory and then making it read-only
>> later runs risks (see threads about BPF JIT races vs making things
>> read-only: https://patchwork.kernel.org/patch/9662653/ Alexei's NAK
>> doesn't change the risk this series is fixing: races with attacker
>> writes during assignment but before read-only marking).
>
> If you are talking about an attacker, rather than protection against
> accidental overwrites, how hashing can be enough?
> Couldn't the attacker compromise that too?

In theory, yes, though the goal was to dedicate a register to the
hash, which would make it hard/impossible for an attacker to reach.
(The BPF JIT situation is just an example of this kind of race,
though. I'm still in favor of reducing the write window to init-time
from full run-time.)

>> So, while smalloc would hugely reduce the window an attacker has
>> available to change data contents, this API doesn't eliminate it. (To
>> eliminate it, there would need to be a CPU-local page permission view
>> that let only the current CPU to the page, and then restore it to
>> read-only to match the global read-only view.)
>
> That or, if one is ready to take the hit, freeze every other possible
> attack vector. But I'm not sure this could be justifiable.

I would expect other people would NAK using "stop all other CPUs"
approach. Better to have the CPU-local writes.

>> Ah! In that case, sure. This isn't what the proposed API provided,
>> though, so let's adjust it to only perform the unseal at destroy time.
>> That makes it much saner, IMO. "Write once" dynamic allocations, or
>> "read-only after seal". woalloc? :P yay naming
>
> For now I'm still using smalloc.
> Anything that is either [x]malloc or [yz]malloc is fine, lengthwise.
> Other options might require some re-formatting.

Yeah, I don't have any better idea for names. :)

>> Ah, okay. Most of the LSM is happily covered by __ro_after_init. If we
>> could just drop the runtime disabling of SELinux, we'd be fine.
>
> I am not sure I understand this point.
> If the kernel is properly configured, the master toggle variable
> disappears, right?
> Or do you mean the disabling through modifications of the linked list of
> the hooks?

We might be talking past each-other. Right now, the LSM is marked with
__ro_after_init, which will make all the list structures, entries, etc
read-only already. There is one case where this is not true, and
that's for CONFIG_SECURITY_WRITABLE_HOOKS for
CONFIG_SECURITY_SELINUX_DISABLE, which wants to do run-time removal of
SELinux. Are you talking about the SELinux policy installed during
early boot? Which things did you want to use smalloc() on?

>> It seems like smalloc pools could also be refcounted?
>
> I am not sure what you mean.
> What do you want to count?
> Number of pools? Nodes per pool? Allocations per node?

I meant things that point into an smalloc() pool could keep a refcount
and when nothing was pointing into it, it could be destroyed. (i.e.
using refcount garbage collection instead of explicit destruction.)

> And what for?

It might be easier to reason about later if allocations get complex.
It's certainly not required for the first version of this.

> At least in the case of tearing down a pool, when a module is unloaded,
> nobody needs to free anything that was allocated with smalloc.
> The teardown function will free the pages from each node.

Right, yeah.


Re: [PATCH 1/1] Sealable memory support

2017-05-24 Thread Igor Stoppa


On 23/05/17 23:11, Kees Cook wrote:
> On Tue, May 23, 2017 at 2:43 AM, Igor Stoppa  wrote:

[...]

> I would want hardened usercopy support as a requirement for using
> smalloc(). Without it, we're regressing the over-read protection that
> already exists for slab objects, if kernel code switched from slab to
> smalloc. It should be very similar to the existing slab checks. "Is
> this a smalloc object? Have we read beyond the end of a given object?"
> etc. The metadata is all there, except for an efficient way to mark a
> page as a smalloc page, but I think that just requires a new Page$Name
> bit test, as done for slab.

ok

[...]

> I meant this:
> 
> CPU 1 CPU 2
> create
> alloc
> write
> seal
> ...
> unseal
> write
> write
> seal
> 
> The CPU 2 write would be, for example, an attacker using a
> vulnerability to attempt to write to memory in the sealed area. All it
> would need to do to succeed would be to trigger an action in the
> kernel that would do a "legitimate" write (which requires the unseal),
> and race it. Unsealing should be CPU-local, if the API is going to
> support this kind of access.

I see.
If the CPU1 were to forcibly halt anything that can race with it, then
it would be sure that there was no interference.
A reactive approach could be, instead, to re-validate the content after
the sealing, assuming that it is possible.

[...]

> I am more concerned about _any_ unseal after initial seal. And even
> then, it'd be nice to keep things CPU-local. My concerns are related
> to the write-rarely proposal (https://lkml.org/lkml/2017/3/29/704)
> which is kind of like this, but focused on the .data section, not
> dynamic memory. It has similar concerns about CPU-locality.
> Additionally, even writing to memory and then making it read-only
> later runs risks (see threads about BPF JIT races vs making things
> read-only: https://patchwork.kernel.org/patch/9662653/ Alexei's NAK
> doesn't change the risk this series is fixing: races with attacker
> writes during assignment but before read-only marking).

If you are talking about an attacker, rather than protection against
accidental overwrites, how hashing can be enough?
Couldn't the attacker compromise that too?

> So, while smalloc would hugely reduce the window an attacker has
> available to change data contents, this API doesn't eliminate it. (To
> eliminate it, there would need to be a CPU-local page permission view
> that let only the current CPU to the page, and then restore it to
> read-only to match the global read-only view.)

That or, if one is ready to take the hit, freeze every other possible
attack vector. But I'm not sure this could be justifiable.

[...]

> Ah! In that case, sure. This isn't what the proposed API provided,
> though, so let's adjust it to only perform the unseal at destroy time.
> That makes it much saner, IMO. "Write once" dynamic allocations, or
> "read-only after seal". woalloc? :P yay naming

For now I'm still using smalloc.
Anything that is either [x]malloc or [yz]malloc is fine, lengthwise.
Other options might require some re-formatting.

[...]

> I think a shared global pool would need to be eliminated for true
> write-once semantics.

ok

[...]

>> I'd rather not add extra locking to something that doesn't need it:
>> Allocate - write - seal - read, read, read, ... - unseal - destroy.
> 
> Yup, I would be totally fine with this. It still has a race between
> allocate and seal, but it's a huge improvement over the existing state
> of the world where all dynamic memory is writable. :)

Great!


[...]

> Ah, okay. Most of the LSM is happily covered by __ro_after_init. If we
> could just drop the runtime disabling of SELinux, we'd be fine.

I am not sure I understand this point.
If the kernel is properly configured, the master toggle variable
disappears, right?
Or do you mean the disabling through modifications of the linked list of
the hooks?

[...]


> Hm, I just meant add a char[] to the metadata and pass it in during
> create(). Then it's possible to report which smalloc cache is being
> examined during hardened usercopy checks.

Ok, that is not a big deal.
wrt this, I have spent some time writing a debug module, which currently
dumps into a debugfs entry a bunch of info about the various pools.
I could split it across multiple entries, using the label to generate
their names.

[...]

> It seems like smalloc pools could also be refcounted?

I am not sure what you mean.
What do you want to count?
Number of pools? Nodes per pool? Allocations per node?

And what for?

At least in the case of tearing down a pool, when a module is unloaded,
nobody needs to free anything that was allocated with smalloc.
The teardown function will free the pages from each node.

Is this the place where you think there should be a check on the number
of pages freed?

[...]

 +#define NODE_HEADER\
 +   struct { 

Re: [PATCH 1/1] Sealable memory support

2017-05-24 Thread Igor Stoppa


On 23/05/17 23:11, Kees Cook wrote:
> On Tue, May 23, 2017 at 2:43 AM, Igor Stoppa  wrote:

[...]

> I would want hardened usercopy support as a requirement for using
> smalloc(). Without it, we're regressing the over-read protection that
> already exists for slab objects, if kernel code switched from slab to
> smalloc. It should be very similar to the existing slab checks. "Is
> this a smalloc object? Have we read beyond the end of a given object?"
> etc. The metadata is all there, except for an efficient way to mark a
> page as a smalloc page, but I think that just requires a new Page$Name
> bit test, as done for slab.

ok

[...]

> I meant this:
> 
> CPU 1 CPU 2
> create
> alloc
> write
> seal
> ...
> unseal
> write
> write
> seal
> 
> The CPU 2 write would be, for example, an attacker using a
> vulnerability to attempt to write to memory in the sealed area. All it
> would need to do to succeed would be to trigger an action in the
> kernel that would do a "legitimate" write (which requires the unseal),
> and race it. Unsealing should be CPU-local, if the API is going to
> support this kind of access.

I see.
If the CPU1 were to forcibly halt anything that can race with it, then
it would be sure that there was no interference.
A reactive approach could be, instead, to re-validate the content after
the sealing, assuming that it is possible.

[...]

> I am more concerned about _any_ unseal after initial seal. And even
> then, it'd be nice to keep things CPU-local. My concerns are related
> to the write-rarely proposal (https://lkml.org/lkml/2017/3/29/704)
> which is kind of like this, but focused on the .data section, not
> dynamic memory. It has similar concerns about CPU-locality.
> Additionally, even writing to memory and then making it read-only
> later runs risks (see threads about BPF JIT races vs making things
> read-only: https://patchwork.kernel.org/patch/9662653/ Alexei's NAK
> doesn't change the risk this series is fixing: races with attacker
> writes during assignment but before read-only marking).

If you are talking about an attacker, rather than protection against
accidental overwrites, how hashing can be enough?
Couldn't the attacker compromise that too?

> So, while smalloc would hugely reduce the window an attacker has
> available to change data contents, this API doesn't eliminate it. (To
> eliminate it, there would need to be a CPU-local page permission view
> that let only the current CPU to the page, and then restore it to
> read-only to match the global read-only view.)

That or, if one is ready to take the hit, freeze every other possible
attack vector. But I'm not sure this could be justifiable.

[...]

> Ah! In that case, sure. This isn't what the proposed API provided,
> though, so let's adjust it to only perform the unseal at destroy time.
> That makes it much saner, IMO. "Write once" dynamic allocations, or
> "read-only after seal". woalloc? :P yay naming

For now I'm still using smalloc.
Anything that is either [x]malloc or [yz]malloc is fine, lengthwise.
Other options might require some re-formatting.

[...]

> I think a shared global pool would need to be eliminated for true
> write-once semantics.

ok

[...]

>> I'd rather not add extra locking to something that doesn't need it:
>> Allocate - write - seal - read, read, read, ... - unseal - destroy.
> 
> Yup, I would be totally fine with this. It still has a race between
> allocate and seal, but it's a huge improvement over the existing state
> of the world where all dynamic memory is writable. :)

Great!


[...]

> Ah, okay. Most of the LSM is happily covered by __ro_after_init. If we
> could just drop the runtime disabling of SELinux, we'd be fine.

I am not sure I understand this point.
If the kernel is properly configured, the master toggle variable
disappears, right?
Or do you mean the disabling through modifications of the linked list of
the hooks?

[...]


> Hm, I just meant add a char[] to the metadata and pass it in during
> create(). Then it's possible to report which smalloc cache is being
> examined during hardened usercopy checks.

Ok, that is not a big deal.
wrt this, I have spent some time writing a debug module, which currently
dumps into a debugfs entry a bunch of info about the various pools.
I could split it across multiple entries, using the label to generate
their names.

[...]

> It seems like smalloc pools could also be refcounted?

I am not sure what you mean.
What do you want to count?
Number of pools? Nodes per pool? Allocations per node?

And what for?

At least in the case of tearing down a pool, when a module is unloaded,
nobody needs to free anything that was allocated with smalloc.
The teardown function will free the pages from each node.

Is this the place where you think there should be a check on the number
of pages freed?

[...]

 +#define NODE_HEADER\
 +   struct {\
 + 

Re: [PATCH 1/1] Sealable memory support

2017-05-23 Thread Kees Cook
On Tue, May 23, 2017 at 2:43 AM, Igor Stoppa  wrote:
> On 23/05/17 00:38, Kees Cook wrote:
>> On Fri, May 19, 2017 at 3:38 AM, Igor Stoppa  wrote:
>
> [...]
>
>> For the first bit of bikeshedding, should this really be called
>> seal/unseal? My mind is probably just broken from having read TPM
>> documentation, but this isn't really "sealing" as I'd understand it
>> (it's not tied to a credential, for example). It's "only" rw/ro.
>> Perhaps "protect/unprotect" or just simply "readonly/writable", and
>> call the base function "romalloc"?
>
> I was not aware of the specific mean of "seal", in this context.
> The term was implicitly proposed by Michal Hocko, while discussing about
> the mechanism and I liked it more than what I was using initially:
> "lockable".
>
> tbh I like the sound of "smalloc" better than "romalloc"
>
> But this is really the least of my worries :-P
>
>> This is fundamentally a heap allocator, with linked lists, etc. I'd
>> like to see as much attention as possible given to hardening it
>> against attacks, especially adding redzoning around the metadata at
>> least, and perhaps requiring that CONFIG_DEBUG_LIST be enabled.
>
> My initial goal was to provide something that is useful without
> affecting performance.
>
> You seem to be pushing for a more extreme approach.

I don't know about pushing, but it seemed like the API provided for
arbitrary unsealing, etc. More below...

> While I have nothing against it and I actually agree that it can be
> useful, I would not make it mandatory.
>
> More on this later.
>
>> And as
>> part of that, I'd like hardened usercopy to grow knowledge of these
>> allocations so we can bounds-check objects. Right now, mm/usercopy.c
>> just looks at PageSlab(page) to decide if it should do slab checks. I
>> think adding a check for this type of object would be very important
>> there.
>
> I am not familiar with this and I need to study it, however I still
> think that if there is a significant trade-off in terms of performance
> vs resilience, it should be optional, for those who want it.

I would want hardened usercopy support as a requirement for using
smalloc(). Without it, we're regressing the over-read protection that
already exists for slab objects, if kernel code switched from slab to
smalloc. It should be very similar to the existing slab checks. "Is
this a smalloc object? Have we read beyond the end of a given object?"
etc. The metadata is all there, except for an efficient way to mark a
page as a smalloc page, but I think that just requires a new Page$Name
bit test, as done for slab.

> Maybe there could be a master toggle for these options, if it makes
> sense to group them logically. How does this sound?
>
>> The ro/rw granularity here is the _entire_ pool, not a specific
>> allocation (or page containing the allocation). I'm concerned that
>> makes this very open to race conditions where, especially in the
>> global pool, one thread can be trying to write to ro data in a pool
>> and another has made the pool writable.
>
> I have the impression we are thinking to something different.
> Close, but different enough.
>
> First of all, how using a mutex can create races?
> Do you mean with respect of other resources that might be held by
> competing users of the pool?

I meant this:

CPU 1 CPU 2
create
alloc
write
seal
...
unseal
write
write
seal

The CPU 2 write would be, for example, an attacker using a
vulnerability to attempt to write to memory in the sealed area. All it
would need to do to succeed would be to trigger an action in the
kernel that would do a "legitimate" write (which requires the unseal),
and race it. Unsealing should be CPU-local, if the API is going to
support this kind of access.

> That, imho, is a locking problem that cannot be solved here.
> You can try to mitigate it, by reducing the chances it will happen, but
> basically you are trying to make do with an user of the API that is not
> implementing locking correctly.
> I'd say that it's better to fix the user.
>
> If you meant something else, I really didn't get it :-)
>
> More about the frequency of the access: you seem to expect very often
> seal/unseal - or lock/unlock, while I don't.

I am more concerned about _any_ unseal after initial seal. And even
then, it'd be nice to keep things CPU-local. My concerns are related
to the write-rarely proposal (https://lkml.org/lkml/2017/3/29/704)
which is kind of like this, but focused on the .data section, not
dynamic memory. It has similar concerns about CPU-locality.
Additionally, even writing to memory and then making it read-only
later runs risks (see threads about BPF JIT races vs making things
read-only: https://patchwork.kernel.org/patch/9662653/ Alexei's NAK
doesn't change the risk this series is fixing: races with attacker
writes during assignment but before read-only marking).

So, while smalloc would hugely reduce the window an attacker has
available to 

Re: [PATCH 1/1] Sealable memory support

2017-05-23 Thread Kees Cook
On Tue, May 23, 2017 at 2:43 AM, Igor Stoppa  wrote:
> On 23/05/17 00:38, Kees Cook wrote:
>> On Fri, May 19, 2017 at 3:38 AM, Igor Stoppa  wrote:
>
> [...]
>
>> For the first bit of bikeshedding, should this really be called
>> seal/unseal? My mind is probably just broken from having read TPM
>> documentation, but this isn't really "sealing" as I'd understand it
>> (it's not tied to a credential, for example). It's "only" rw/ro.
>> Perhaps "protect/unprotect" or just simply "readonly/writable", and
>> call the base function "romalloc"?
>
> I was not aware of the specific mean of "seal", in this context.
> The term was implicitly proposed by Michal Hocko, while discussing about
> the mechanism and I liked it more than what I was using initially:
> "lockable".
>
> tbh I like the sound of "smalloc" better than "romalloc"
>
> But this is really the least of my worries :-P
>
>> This is fundamentally a heap allocator, with linked lists, etc. I'd
>> like to see as much attention as possible given to hardening it
>> against attacks, especially adding redzoning around the metadata at
>> least, and perhaps requiring that CONFIG_DEBUG_LIST be enabled.
>
> My initial goal was to provide something that is useful without
> affecting performance.
>
> You seem to be pushing for a more extreme approach.

I don't know about pushing, but it seemed like the API provided for
arbitrary unsealing, etc. More below...

> While I have nothing against it and I actually agree that it can be
> useful, I would not make it mandatory.
>
> More on this later.
>
>> And as
>> part of that, I'd like hardened usercopy to grow knowledge of these
>> allocations so we can bounds-check objects. Right now, mm/usercopy.c
>> just looks at PageSlab(page) to decide if it should do slab checks. I
>> think adding a check for this type of object would be very important
>> there.
>
> I am not familiar with this and I need to study it, however I still
> think that if there is a significant trade-off in terms of performance
> vs resilience, it should be optional, for those who want it.

I would want hardened usercopy support as a requirement for using
smalloc(). Without it, we're regressing the over-read protection that
already exists for slab objects, if kernel code switched from slab to
smalloc. It should be very similar to the existing slab checks. "Is
this a smalloc object? Have we read beyond the end of a given object?"
etc. The metadata is all there, except for an efficient way to mark a
page as a smalloc page, but I think that just requires a new Page$Name
bit test, as done for slab.

> Maybe there could be a master toggle for these options, if it makes
> sense to group them logically. How does this sound?
>
>> The ro/rw granularity here is the _entire_ pool, not a specific
>> allocation (or page containing the allocation). I'm concerned that
>> makes this very open to race conditions where, especially in the
>> global pool, one thread can be trying to write to ro data in a pool
>> and another has made the pool writable.
>
> I have the impression we are thinking to something different.
> Close, but different enough.
>
> First of all, how using a mutex can create races?
> Do you mean with respect of other resources that might be held by
> competing users of the pool?

I meant this:

CPU 1 CPU 2
create
alloc
write
seal
...
unseal
write
write
seal

The CPU 2 write would be, for example, an attacker using a
vulnerability to attempt to write to memory in the sealed area. All it
would need to do to succeed would be to trigger an action in the
kernel that would do a "legitimate" write (which requires the unseal),
and race it. Unsealing should be CPU-local, if the API is going to
support this kind of access.

> That, imho, is a locking problem that cannot be solved here.
> You can try to mitigate it, by reducing the chances it will happen, but
> basically you are trying to make do with an user of the API that is not
> implementing locking correctly.
> I'd say that it's better to fix the user.
>
> If you meant something else, I really didn't get it :-)
>
> More about the frequency of the access: you seem to expect very often
> seal/unseal - or lock/unlock, while I don't.

I am more concerned about _any_ unseal after initial seal. And even
then, it'd be nice to keep things CPU-local. My concerns are related
to the write-rarely proposal (https://lkml.org/lkml/2017/3/29/704)
which is kind of like this, but focused on the .data section, not
dynamic memory. It has similar concerns about CPU-locality.
Additionally, even writing to memory and then making it read-only
later runs risks (see threads about BPF JIT races vs making things
read-only: https://patchwork.kernel.org/patch/9662653/ Alexei's NAK
doesn't change the risk this series is fixing: races with attacker
writes during assignment but before read-only marking).

So, while smalloc would hugely reduce the window an attacker has
available to change data contents, this API doesn't 

Re: [PATCH 1/1] Sealable memory support

2017-05-23 Thread Igor Stoppa
On 23/05/17 00:38, Kees Cook wrote:
> On Fri, May 19, 2017 at 3:38 AM, Igor Stoppa  wrote:

[...]

> For the first bit of bikeshedding, should this really be called
> seal/unseal? My mind is probably just broken from having read TPM
> documentation, but this isn't really "sealing" as I'd understand it
> (it's not tied to a credential, for example). It's "only" rw/ro.
> Perhaps "protect/unprotect" or just simply "readonly/writable", and
> call the base function "romalloc"?

I was not aware of the specific mean of "seal", in this context.
The term was implicitly proposed by Michal Hocko, while discussing about
the mechanism and I liked it more than what I was using initially:
"lockable".

tbh I like the sound of "smalloc" better than "romalloc"

But this is really the least of my worries :-P

> This is fundamentally a heap allocator, with linked lists, etc. I'd
> like to see as much attention as possible given to hardening it
> against attacks, especially adding redzoning around the metadata at
> least, and perhaps requiring that CONFIG_DEBUG_LIST be enabled.

My initial goal was to provide something that is useful without
affecting performance.

You seem to be pushing for a more extreme approach.
While I have nothing against it and I actually agree that it can be
useful, I would not make it mandatory.

More on this later.

> And as
> part of that, I'd like hardened usercopy to grow knowledge of these
> allocations so we can bounds-check objects. Right now, mm/usercopy.c
> just looks at PageSlab(page) to decide if it should do slab checks. I
> think adding a check for this type of object would be very important
> there.

I am not familiar with this and I need to study it, however I still
think that if there is a significant trade-off in terms of performance
vs resilience, it should be optional, for those who want it.

Maybe there could be a master toggle for these options, if it makes
sense to group them logically. How does this sound?

> The ro/rw granularity here is the _entire_ pool, not a specific
> allocation (or page containing the allocation). I'm concerned that
> makes this very open to race conditions where, especially in the
> global pool, one thread can be trying to write to ro data in a pool
> and another has made the pool writable.

I have the impression we are thinking to something different.
Close, but different enough.

First of all, how using a mutex can create races?
Do you mean with respect of other resources that might be held by
competing users of the pool?

That, imho, is a locking problem that cannot be solved here.
You can try to mitigate it, by reducing the chances it will happen, but
basically you are trying to make do with an user of the API that is not
implementing locking correctly.
I'd say that it's better to fix the user.

If you meant something else, I really didn't get it :-)

More about the frequency of the access: you seem to expect very often
seal/unseal - or lock/unlock, while I don't.

What I envision as primary case for unlocking is the tear down of the
entire pool, for example preceding the unloading of the module.

Think about __ro_after_init : once it is r/o, it stays r/o.

Which also means that there would be possibly a busy transient, when
allocations are made. That is true.
But only for shared pools. If a module uses one pool, I would expect the
initialization of the module to be mostly sequential.
So, no chance of races. Do you agree?

Furthermore, if a module _really_ wants to do parallel allocation, then
maybe it's simpler and cleaner to have one pool per "thread" or whatever
is the mechanism used.


The global pool is mostly there for completing the offering of
__ro_after_init. If one wants ot use it, it's available.
It saves the trouble of having own pool, it means competing with the
rest of the kernel for locking.

If it seems a bad idea, it can be easily removed.

[...]

>> +#include "smalloc.h"
> 
> Shouldn't this just be  ?

yes, I have fixed it in the next revision

[...]

>> +   /* If no pool specified, use the global one. */
>> +   if (!pool)
>> +   pool = global_pool;
> 
> It should be impossible, but this should check for global_pool == NULL too, 
> IMO.

I'm curious: why do you think it could happen?
I'd rather throw an exception, if I cannot initialize global_pool.

[...]

>> +   if (pool->seal == seal) {
>> +   mutex_unlock(>lock);
>> +   return;
> 
> I actually think this should be a BUG condition, since this means a
> mismatched seal/unseal happened. The pool should never be left
> writable at rest, a user should create a pool, write to it, seal. 

On this I agree.

> Any
> updates should unseal, write, seal. To attempt an unseal and find it
> already unsealed seems bad.

But what you are describing seems exactly the case where there could be
a race.
EX: 2 writers try to unseal and add/change something.
They could have a collision both when trying to seal and unseal.

One 

Re: [PATCH 1/1] Sealable memory support

2017-05-23 Thread Igor Stoppa
On 23/05/17 00:38, Kees Cook wrote:
> On Fri, May 19, 2017 at 3:38 AM, Igor Stoppa  wrote:

[...]

> For the first bit of bikeshedding, should this really be called
> seal/unseal? My mind is probably just broken from having read TPM
> documentation, but this isn't really "sealing" as I'd understand it
> (it's not tied to a credential, for example). It's "only" rw/ro.
> Perhaps "protect/unprotect" or just simply "readonly/writable", and
> call the base function "romalloc"?

I was not aware of the specific mean of "seal", in this context.
The term was implicitly proposed by Michal Hocko, while discussing about
the mechanism and I liked it more than what I was using initially:
"lockable".

tbh I like the sound of "smalloc" better than "romalloc"

But this is really the least of my worries :-P

> This is fundamentally a heap allocator, with linked lists, etc. I'd
> like to see as much attention as possible given to hardening it
> against attacks, especially adding redzoning around the metadata at
> least, and perhaps requiring that CONFIG_DEBUG_LIST be enabled.

My initial goal was to provide something that is useful without
affecting performance.

You seem to be pushing for a more extreme approach.
While I have nothing against it and I actually agree that it can be
useful, I would not make it mandatory.

More on this later.

> And as
> part of that, I'd like hardened usercopy to grow knowledge of these
> allocations so we can bounds-check objects. Right now, mm/usercopy.c
> just looks at PageSlab(page) to decide if it should do slab checks. I
> think adding a check for this type of object would be very important
> there.

I am not familiar with this and I need to study it, however I still
think that if there is a significant trade-off in terms of performance
vs resilience, it should be optional, for those who want it.

Maybe there could be a master toggle for these options, if it makes
sense to group them logically. How does this sound?

> The ro/rw granularity here is the _entire_ pool, not a specific
> allocation (or page containing the allocation). I'm concerned that
> makes this very open to race conditions where, especially in the
> global pool, one thread can be trying to write to ro data in a pool
> and another has made the pool writable.

I have the impression we are thinking to something different.
Close, but different enough.

First of all, how using a mutex can create races?
Do you mean with respect of other resources that might be held by
competing users of the pool?

That, imho, is a locking problem that cannot be solved here.
You can try to mitigate it, by reducing the chances it will happen, but
basically you are trying to make do with an user of the API that is not
implementing locking correctly.
I'd say that it's better to fix the user.

If you meant something else, I really didn't get it :-)

More about the frequency of the access: you seem to expect very often
seal/unseal - or lock/unlock, while I don't.

What I envision as primary case for unlocking is the tear down of the
entire pool, for example preceding the unloading of the module.

Think about __ro_after_init : once it is r/o, it stays r/o.

Which also means that there would be possibly a busy transient, when
allocations are made. That is true.
But only for shared pools. If a module uses one pool, I would expect the
initialization of the module to be mostly sequential.
So, no chance of races. Do you agree?

Furthermore, if a module _really_ wants to do parallel allocation, then
maybe it's simpler and cleaner to have one pool per "thread" or whatever
is the mechanism used.


The global pool is mostly there for completing the offering of
__ro_after_init. If one wants ot use it, it's available.
It saves the trouble of having own pool, it means competing with the
rest of the kernel for locking.

If it seems a bad idea, it can be easily removed.

[...]

>> +#include "smalloc.h"
> 
> Shouldn't this just be  ?

yes, I have fixed it in the next revision

[...]

>> +   /* If no pool specified, use the global one. */
>> +   if (!pool)
>> +   pool = global_pool;
> 
> It should be impossible, but this should check for global_pool == NULL too, 
> IMO.

I'm curious: why do you think it could happen?
I'd rather throw an exception, if I cannot initialize global_pool.

[...]

>> +   if (pool->seal == seal) {
>> +   mutex_unlock(>lock);
>> +   return;
> 
> I actually think this should be a BUG condition, since this means a
> mismatched seal/unseal happened. The pool should never be left
> writable at rest, a user should create a pool, write to it, seal. 

On this I agree.

> Any
> updates should unseal, write, seal. To attempt an unseal and find it
> already unsealed seems bad.

But what you are describing seems exactly the case where there could be
a race.
EX: 2 writers try to unseal and add/change something.
They could have a collision both when trying to seal and unseal.

One could argue that there 

Re: [PATCH 1/1] Sealable memory support

2017-05-22 Thread Kees Cook
On Fri, May 19, 2017 at 3:38 AM, Igor Stoppa  wrote:
> Dynamically allocated variables can be made read only,
> after they have been initialized, provided that they reside in memory
> pages devoid of any RW data.
>
> The implementation supplies means to create independent pools of memory,
> which can be individually created, sealed/unsealed and destroyed.

This would be a welcome addition, thanks for posting it! I have a
bunch of feedback, here and below:

For the first bit of bikeshedding, should this really be called
seal/unseal? My mind is probably just broken from having read TPM
documentation, but this isn't really "sealing" as I'd understand it
(it's not tied to a credential, for example). It's "only" rw/ro.
Perhaps "protect/unprotect" or just simply "readonly/writable", and
call the base function "romalloc"?

This is fundamentally a heap allocator, with linked lists, etc. I'd
like to see as much attention as possible given to hardening it
against attacks, especially adding redzoning around the metadata at
least, and perhaps requiring that CONFIG_DEBUG_LIST be enabled. And as
part of that, I'd like hardened usercopy to grow knowledge of these
allocations so we can bounds-check objects. Right now, mm/usercopy.c
just looks at PageSlab(page) to decide if it should do slab checks. I
think adding a check for this type of object would be very important
there.

The ro/rw granularity here is the _entire_ pool, not a specific
allocation (or page containing the allocation). I'm concerned that
makes this very open to race conditions where, especially in the
global pool, one thread can be trying to write to ro data in a pool
and another has made the pool writable.

> A global pool is made available for those kernel modules that do not
> need to manage an independent pool.
>
> Signed-off-by: Igor Stoppa 
> ---
>  mm/Makefile  |   2 +-
>  mm/smalloc.c | 200 
> +++
>  mm/smalloc.h |  61 ++
>  3 files changed, 262 insertions(+), 1 deletion(-)
>  create mode 100644 mm/smalloc.c
>  create mode 100644 mm/smalloc.h
>
> diff --git a/mm/Makefile b/mm/Makefile
> index 026f6a8..737c42a 100644
> --- a/mm/Makefile
> +++ b/mm/Makefile
> @@ -39,7 +39,7 @@ obj-y := filemap.o mempool.o oom_kill.o \
>mm_init.o mmu_context.o percpu.o slab_common.o \
>compaction.o vmacache.o swap_slots.o \
>interval_tree.o list_lru.o workingset.o \
> -  debug.o $(mmu-y)
> +  debug.o smalloc.o $(mmu-y)
>
>  obj-y += init-mm.o
>
> diff --git a/mm/smalloc.c b/mm/smalloc.c
> new file mode 100644
> index 000..fa04cc5
> --- /dev/null
> +++ b/mm/smalloc.c
> @@ -0,0 +1,200 @@
> +/*
> + * smalloc.c: Sealable Memory Allocator
> + *
> + * (C) Copyright 2017 Huawei Technologies Co. Ltd.
> + * Author: Igor Stoppa 
> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU General Public License
> + * as published by the Free Software Foundation; version 2
> + * of the License.
> + */
> +
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +
> +#include 
> +#include 
> +#include "smalloc.h"

Shouldn't this just be  ?

> +#define page_roundup(size) (((size) + !(size) - 1 + PAGE_SIZE) & PAGE_MASK)
> +
> +#define pages_nr(size) (page_roundup(size) / PAGE_SIZE)
> +
> +static struct smalloc_pool *global_pool;
> +
> +struct smalloc_node *__smalloc_create_node(unsigned long words)
> +{
> +   struct smalloc_node *node;
> +   unsigned long size;
> +
> +   /* Calculate the size to ask from vmalloc, page aligned. */
> +   size = page_roundup(NODE_HEADER_SIZE + words * sizeof(align_t));
> +   node = vmalloc(size);
> +   if (!node) {
> +   pr_err("No memory for allocating smalloc node.");
> +   return NULL;
> +   }
> +   /* Initialize the node.*/
> +   INIT_LIST_HEAD(>list);
> +   node->free = node->data;
> +   node->available_words = (size - NODE_HEADER_SIZE) / sizeof(align_t);
> +   return node;
> +}
> +
> +static __always_inline
> +void *node_alloc(struct smalloc_node *node, unsigned long words)
> +{
> +   register align_t *old_free = node->free;
> +
> +   node->available_words -= words;
> +   node->free += words;
> +   return old_free;
> +}
> +
> +void *smalloc(unsigned long size, struct smalloc_pool *pool)
> +{
> +   struct list_head *pos;
> +   struct smalloc_node *node;
> +   void *ptr;
> +   unsigned long words;
> +
> +   /* If no pool specified, use the global one. */
> +   if (!pool)
> +   pool = global_pool;

It should be impossible, but this should check for global_pool == NULL too, IMO.

> +   mutex_lock(>lock);
> +
> +   /* If the pool is 

Re: [PATCH 1/1] Sealable memory support

2017-05-22 Thread Kees Cook
On Fri, May 19, 2017 at 3:38 AM, Igor Stoppa  wrote:
> Dynamically allocated variables can be made read only,
> after they have been initialized, provided that they reside in memory
> pages devoid of any RW data.
>
> The implementation supplies means to create independent pools of memory,
> which can be individually created, sealed/unsealed and destroyed.

This would be a welcome addition, thanks for posting it! I have a
bunch of feedback, here and below:

For the first bit of bikeshedding, should this really be called
seal/unseal? My mind is probably just broken from having read TPM
documentation, but this isn't really "sealing" as I'd understand it
(it's not tied to a credential, for example). It's "only" rw/ro.
Perhaps "protect/unprotect" or just simply "readonly/writable", and
call the base function "romalloc"?

This is fundamentally a heap allocator, with linked lists, etc. I'd
like to see as much attention as possible given to hardening it
against attacks, especially adding redzoning around the metadata at
least, and perhaps requiring that CONFIG_DEBUG_LIST be enabled. And as
part of that, I'd like hardened usercopy to grow knowledge of these
allocations so we can bounds-check objects. Right now, mm/usercopy.c
just looks at PageSlab(page) to decide if it should do slab checks. I
think adding a check for this type of object would be very important
there.

The ro/rw granularity here is the _entire_ pool, not a specific
allocation (or page containing the allocation). I'm concerned that
makes this very open to race conditions where, especially in the
global pool, one thread can be trying to write to ro data in a pool
and another has made the pool writable.

> A global pool is made available for those kernel modules that do not
> need to manage an independent pool.
>
> Signed-off-by: Igor Stoppa 
> ---
>  mm/Makefile  |   2 +-
>  mm/smalloc.c | 200 
> +++
>  mm/smalloc.h |  61 ++
>  3 files changed, 262 insertions(+), 1 deletion(-)
>  create mode 100644 mm/smalloc.c
>  create mode 100644 mm/smalloc.h
>
> diff --git a/mm/Makefile b/mm/Makefile
> index 026f6a8..737c42a 100644
> --- a/mm/Makefile
> +++ b/mm/Makefile
> @@ -39,7 +39,7 @@ obj-y := filemap.o mempool.o oom_kill.o \
>mm_init.o mmu_context.o percpu.o slab_common.o \
>compaction.o vmacache.o swap_slots.o \
>interval_tree.o list_lru.o workingset.o \
> -  debug.o $(mmu-y)
> +  debug.o smalloc.o $(mmu-y)
>
>  obj-y += init-mm.o
>
> diff --git a/mm/smalloc.c b/mm/smalloc.c
> new file mode 100644
> index 000..fa04cc5
> --- /dev/null
> +++ b/mm/smalloc.c
> @@ -0,0 +1,200 @@
> +/*
> + * smalloc.c: Sealable Memory Allocator
> + *
> + * (C) Copyright 2017 Huawei Technologies Co. Ltd.
> + * Author: Igor Stoppa 
> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU General Public License
> + * as published by the Free Software Foundation; version 2
> + * of the License.
> + */
> +
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +
> +#include 
> +#include 
> +#include "smalloc.h"

Shouldn't this just be  ?

> +#define page_roundup(size) (((size) + !(size) - 1 + PAGE_SIZE) & PAGE_MASK)
> +
> +#define pages_nr(size) (page_roundup(size) / PAGE_SIZE)
> +
> +static struct smalloc_pool *global_pool;
> +
> +struct smalloc_node *__smalloc_create_node(unsigned long words)
> +{
> +   struct smalloc_node *node;
> +   unsigned long size;
> +
> +   /* Calculate the size to ask from vmalloc, page aligned. */
> +   size = page_roundup(NODE_HEADER_SIZE + words * sizeof(align_t));
> +   node = vmalloc(size);
> +   if (!node) {
> +   pr_err("No memory for allocating smalloc node.");
> +   return NULL;
> +   }
> +   /* Initialize the node.*/
> +   INIT_LIST_HEAD(>list);
> +   node->free = node->data;
> +   node->available_words = (size - NODE_HEADER_SIZE) / sizeof(align_t);
> +   return node;
> +}
> +
> +static __always_inline
> +void *node_alloc(struct smalloc_node *node, unsigned long words)
> +{
> +   register align_t *old_free = node->free;
> +
> +   node->available_words -= words;
> +   node->free += words;
> +   return old_free;
> +}
> +
> +void *smalloc(unsigned long size, struct smalloc_pool *pool)
> +{
> +   struct list_head *pos;
> +   struct smalloc_node *node;
> +   void *ptr;
> +   unsigned long words;
> +
> +   /* If no pool specified, use the global one. */
> +   if (!pool)
> +   pool = global_pool;

It should be impossible, but this should check for global_pool == NULL too, IMO.

> +   mutex_lock(>lock);
> +
> +   /* If the pool is sealed, then return NULL. */
> +   if (pool->seal == SMALLOC_SEALED)