Hi,

This is v2 of the "KASLR feature to randomize each loadable module" patchset.
The purpose is to increase the randomization and makes the modules randomized
in relation to each other instead of just the base, so that if one module leaks,
the location of the others can't be inferred.

This code needs some refactoring and simplification, but I was hoping to get
some feedback on the benchmarks and provide an update.

V2 brings the TLB flushes down to close to the existing algorithm and increases
the modules that get high randomness based on the concerns raised by Jann Horn
about the BPF JIT use case. It also tries to address Kees Cook's comments about
possible minimal boot time regression by measuring the average allocation time
to be below the existing allocator. It also addresses Mathew Wilcox's comment
on the GFP_NOWARN not being needed. There is also some data on PTE memory use
which is higher than the original algorithm, as suggested by Jann.

This is off of 4.18-RC3.

Changes since v1:
 - New implementation of __vmalloc_node_try_addr based on the
   __vmalloc_node_range implementation, that only flushes TLB when needed.
 - Modified module loading algorithm to try to reduce the TLB flushes further.
 - Increase "random area" tries in order to increase the number of modules that
   can get high randomness.
 - Increase "random area" size to 2/3 of module area in order to increase the
   number of modules that can get high randomness.
 - Fix for 0day failures on other architectures.
 - Fix for wrong debugfs permissions. (thanks to Jann)
 - Spelling fix .(thanks to Jann)
 - Data on module_alloc performance and TLB flushes. (brought up by Kees and
   Jann)
 - Data on memory usage. (suggested by Jann)
 
Todo:
 - Refactor __vmalloc_node_try_addr to be smaller and share more code with the
   normal vmalloc path, and misc cleanup
 - More real world testing other than the synthetic micro benchmarks described
   below. BPF kselftest brought up by Daniel Borkman.

New Algorithm
=============
In addition to __vmalloc_node_try_addr only purging the lazy free areas when it
needs to, it also now supports a mode where it will fail to allocate instead of
doing any purge. In this case it reports when the allocation would have
succeeded if it was allowed to purge. It returns this information via an
ERR_PTR.

The logic for the selection of a location in the random ara is changed as well.
The number of tries is increased from 10 to 10000, which actually still gives
good performance. At a high level, the vmalloc algorithm quickly traverses an
RB-Tree to find a start position and then more slowly traverses a link list to
look for an open spot. Since this new algorithm randomly picks a spot, it
mostly just needs to traverse the RB-Tree, and as a result the "tries" are fast
enough that the number can be high and still be faster than traversing the
linked list. In the data below you can see that the random algorithm is on
average actually faster than the existing one.

The increase in number of tries is also to support the BPF JIT use case, by
increasing the number of modules that can get high randomness.

Since the __vmalloc_node_try_addr now can optionally fail instead of purging,
for the first half of the tries, the algorithm tries to find a spot where it
doesn't need to do a purge. For the second half it allows purges. The 50:50
split is to try to be a happy medium between reducing TLB flushes and reducing
average allocation time.

Randomness
==========
In the last patchset the size of the random area used in the calculations was
incorrect. The entropy should have been closer to 17 bits, not 18, which is why
its lower here even though the number of random area tries is cranked up. 17.3
bits is likely maintained to much higher number of allocations than shown here
in reality, since it seems that the BPF JIT allocations are usually smaller than
modules. If you assume the majority of allocations are 1 page, 17 bits is
maintained to 8000 modules.

Modules         Min Info
1000            17.3
2000            17.3
3000            17.3
4000            17.3
5000            17.08
6000            16.30
7000            15.67
8000            14.92

Allocation Time
===============
The average module_alloc time over N modules was actually always faster with the
random algorithm:

Modules Existing(ns)    New(ns)
1000    4,761           1,134
2000    9,730           1,149
3000    15,572          1,396
4000    20,723          2,161
5000    26,206          4,349
6000    31,374          8,615
7000    36,123          14,009
8000    40,174          23,396

Average Nth Allocation time was usually better than the existing algorithm,
until the modules get very high.

Module  Original(ns)    New(ns)
1000    8,800           1,288
2000    20,949          1,477
3000    31,980          2,583
4000    44,539          9,250
5000    55,212          25,986
6000    65,968          39,540
7000    74,883          57,798
8000    85,392          97,319

TLB Flushes Per Allocation
==========================
The new algorithm flushes the TLB a little bit more than the existing algorithm.
For the sake of comparison to the old simpler __vmalloc_node_try_addr
implementation, this is about a 238x improvement in some cases.

Modules Existing        New
1000    0.0014          0.001407
2000    0.001746        0.0018155
3000    0.0018186667    0.0021186667
4000    0.00187525      0.00249675
5000    0.001897        0.0025334
6000    0.0019066667    0.0025228333
7000    0.001925        0.0025315714
8000    0.0019325       0.002553

Memory Usage
============
A downside is that since the random area is fragmented, it uses extra PTE pages.
It first approaches 1.3 MB of PTEs as the random area fills. After that it
increases more slowly. I am not sure this can be improved without reducing
randomness.

Modules Existing(pages) New(pages)
100     6               159
200     11              240
300     15              285
400     20              307
500     23              315
1000    41              330
2000    80              335
3000    118             338

Module Capacity
===============
The estimate of module capacity also now goes back up to ~17000, so the real
value should be close to the existing algorithm.


Rick Edgecombe (3):
  vmalloc: Add __vmalloc_node_try_addr function
  x86/modules: Increase randomization for modules
  vmalloc: Add debugfs modfraginfo

 arch/x86/include/asm/pgtable_64_types.h |   1 +
 arch/x86/kernel/module.c                | 103 +++++++++++-
 include/linux/vmalloc.h                 |   3 +
 mm/vmalloc.c                            | 275 +++++++++++++++++++++++++++++++-
 4 files changed, 375 insertions(+), 7 deletions(-)

-- 
2.7.4

Reply via email to