Re: [PATCH v5] Randomized slab caches for kmalloc()

2023-09-11 Thread Kees Cook
On Mon, Sep 11, 2023 at 11:18:15PM +0200, jvoisin wrote:
> I wrote a small blogpost[1] about this series, and was told[2] that it
> would be interesting to share it on this thread, so here it is, copied
> verbatim:

Thanks for posting!

> Ruiqi Gong and Xiu Jianfeng got their
> [Randomized slab caches for
> kmalloc()](https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=3c6152940584290668b35fa0800026f6a1ae05fe)
> patch series merged upstream, and I've and enough discussions about it to
> warrant summarising them into a small blogpost.
> 
> The main idea is to have multiple slab caches, and pick one at random
> based on
> the address of code calling `kmalloc()` and a per-boot seed, to make
> heap-spraying harder.
> It's a great idea, but comes with some shortcomings for now:
> 
> - Objects being allocated via wrappers around `kmalloc()`, like
> `sock_kmalloc`,
>   `f2fs_kmalloc`, `aligned_kmalloc`, … will end up in the same slab cache.

I'd love to see some way to "unwrap" these kinds of allocators. Right
now we try to manually mark them so the debugging options can figure out
what did the allocation, but it's not complete by any means.

I'd kind of like to see a common front end that specified some set of
"do stuff" routines. e.g. to replace devm_kmalloc(), we could have:

void *alloc(size_t usable, gfp_t flags,
size_t (*prepare)(size_t, gfp_t, void *ctx),
void * (*finish)(size_t, gfp_t, void *ctx, void *allocated)
void * ctx)

ssize_t devm_prep(size_t usable, gfp_t *flags, void *ctx)
{
ssize_t tot_size;

if (unlikely(check_add_overflow(sizeof(struct devres),
size, _size)))
return -ENOMEM;

tot_size = kmalloc_size_roundup(tot_size);
*flags |= __GFP_ZERO;

return tot_size;
}

void *devm_finish(size_t usable, gfp_t flags, void *ctx, void 
*allocated)
{
struct devres *dr = allocated;
struct device *dev = ctx;

INIT_LIST_HEAD(>node.entry);
dr->node.release = devm_kmalloc_release;

set_node_dbginfo(>node, "devm_kzalloc_release", usable);
devres_add(dev, dr->data);
return dr->data;
}

#define devm_kmalloc(dev, size, gfp)\
alloc(size, gfp, devm_prep, devm_finish, dev)

And now there's no wrapper any more, just a routine to get the actual
size, and a routine to set up the memory and return the "usable"
pointer.

> - The slabs needs to be pinned, otherwise an attacker could
> [feng-shui](https://en.wikipedia.org/wiki/Heap_feng_shui) their way
>   into having the whole slab free'ed, garbage-collected, and have a slab for
>   another type allocated at the same VA. [Jann Horn](https://thejh.net/)
> and [Matteo Rizzo](https://infosec.exchange/@nspace) have a [nice
>   set of
> patches](https://github.com/torvalds/linux/compare/master...thejh:linux:slub-virtual-upstream),
>   discussed a bit in [this Project Zero
> blogpost](https://googleprojectzero.blogspot.com/2021/10/how-simple-linux-kernel-memory.html),
>   for a feature called [`SLAB_VIRTUAL`](
> https://github.com/torvalds/linux/commit/f3afd3a2152353be355b90f5fd4367adbf6a955e),
>   implementing precisely this.

I'm hoping this will get posted to LKML soon.

> - There are 16 slabs by default, so one chance out of 16 to end up in
> the same
>   slab cache as the target.

Future work can make this more deterministic.

> - There are no guard pages between caches, so inter-caches overflows are
>   possible.

This may be addressed by SLAB_VIRTUAL.

> - As pointed by
> [andreyknvl](https://twitter.com/andreyknvl/status/1700267669336080678)
>   and [minipli](https://infosec.exchange/@minipli/111045336853055793),
>   the fewer allocations hitting a given cache means less noise,
>   so it might even help with some heap feng-shui.

That may be true, but I suspect it'll be mitigated by the overall
reduction shared caches.

> - minipli also pointed that "randomized caches still freely
>   mix kernel allocations with user controlled ones (`xattr`, `keyctl`,
> `msg_msg`, …).
>   So even though merging is disabled for these caches, i.e. no direct
> overlap
>   with `cred_jar` etc., other object types can still be targeted (`struct
>   pipe_buffer`, BPF maps, its verifier state objects,…). It’s just a
> matter of
>   probing which allocation index the targeted object falls into.",
>   but I considered this out of scope, since it's much more involved;
>   albeit something like
> [`CONFIG_KMALLOC_SPLIT_VARSIZE`](https://github.com/thejh/linux/blob/slub-virtual/MITIGATION_README)
>   wouldn't significantly increase complexity.

Now that we have a mechanism to easily deal with "many kmalloc buckets",
I think we can easily start carving out specific variable-sized 

Re: [PATCH v5] Randomized slab caches for kmalloc()

2023-09-11 Thread jvoisin
I wrote a small blogpost[1] about this series, and was told[2] that it
would be interesting to share it on this thread, so here it is, copied
verbatim:

Ruiqi Gong and Xiu Jianfeng got their
[Randomized slab caches for
kmalloc()](https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=3c6152940584290668b35fa0800026f6a1ae05fe)
patch series merged upstream, and I've and enough discussions about it to
warrant summarising them into a small blogpost.

The main idea is to have multiple slab caches, and pick one at random
based on
the address of code calling `kmalloc()` and a per-boot seed, to make
heap-spraying harder.
It's a great idea, but comes with some shortcomings for now:

- Objects being allocated via wrappers around `kmalloc()`, like
`sock_kmalloc`,
  `f2fs_kmalloc`, `aligned_kmalloc`, … will end up in the same slab cache.
- The slabs needs to be pinned, otherwise an attacker could
[feng-shui](https://en.wikipedia.org/wiki/Heap_feng_shui) their way
  into having the whole slab free'ed, garbage-collected, and have a slab for
  another type allocated at the same VA. [Jann Horn](https://thejh.net/)
and [Matteo Rizzo](https://infosec.exchange/@nspace) have a [nice
  set of
patches](https://github.com/torvalds/linux/compare/master...thejh:linux:slub-virtual-upstream),
  discussed a bit in [this Project Zero
blogpost](https://googleprojectzero.blogspot.com/2021/10/how-simple-linux-kernel-memory.html),
  for a feature called [`SLAB_VIRTUAL`](
https://github.com/torvalds/linux/commit/f3afd3a2152353be355b90f5fd4367adbf6a955e),
  implementing precisely this.
- There are 16 slabs by default, so one chance out of 16 to end up in
the same
  slab cache as the target.
- There are no guard pages between caches, so inter-caches overflows are
  possible.
- As pointed by
[andreyknvl](https://twitter.com/andreyknvl/status/1700267669336080678)
  and [minipli](https://infosec.exchange/@minipli/111045336853055793),
  the fewer allocations hitting a given cache means less noise,
  so it might even help with some heap feng-shui.
- minipli also pointed that "randomized caches still freely
  mix kernel allocations with user controlled ones (`xattr`, `keyctl`,
`msg_msg`, …).
  So even though merging is disabled for these caches, i.e. no direct
overlap
  with `cred_jar` etc., other object types can still be targeted (`struct
  pipe_buffer`, BPF maps, its verifier state objects,…). It’s just a
matter of
  probing which allocation index the targeted object falls into.",
  but I considered this out of scope, since it's much more involved;
  albeit something like
[`CONFIG_KMALLOC_SPLIT_VARSIZE`](https://github.com/thejh/linux/blob/slub-virtual/MITIGATION_README)
  wouldn't significantly increase complexity.

Also, while code addresses as a source of entropy has historically be a
great
way to provide [KASLR](https://lwn.net/Articles/569635/) bypasses,
`hash_64(caller ^
random_kmalloc_seed, ilog2(RANDOM_KMALLOC_CACHES_NR + 1))` shouldn't
trivially
leak offsets.

The segregation technique is a bit like a weaker version of grsecurity's
[AUTOSLAB](https://grsecurity.net/how_autoslab_changes_the_memory_unsafety_game),
or a weaker kernel-land version of
[PartitionAlloc](https://chromium.googlesource.com/chromium/src/+/master/base/allocator/partition_allocator/PartitionAlloc.md),
but to be fair, making use-after-free exploitation harder, and significantly
harder once pinning lands, with only ~150 lines of code and negligible
performance impact is amazing and should be praised. Moreover, I wouldn't be
surprised if this was backported in [Google's
KernelCTF](https://google.github.io/security-research/kernelctf/rules.html)
soon, so we should see if my analysis is correct.

1.
https://dustri.org/b/some-notes-on-randomized-slab-caches-for-kmalloc.html
2. https://infosec.exchange/@vba...@social.kernel.org/111046740392510260

-- 
Julien (jvoisin) Voisin
GPG: 04D041E8171901CC
dustri.org