Re: hardened malloc is big and slow

2022-09-15 Thread Daniel Micay via devel
On Wed, Sep 07, 2022 at 08:39:56AM -0700, John Reiser wrote:
> On 9/5/22 19:45, Daniel Micay wrote:
> > On Wed, Aug 31, 2022 at 10:19:51AM -0700, John Reiser wrote:
> > > > Bottom line opinion: hardened_malloc ... costs too much.
> > > 
> > > Attempting to be constructive: Psychologically, I might be willing to pay
> > > a "security tax" of something like 17%, partly on the basis of similarity
> > > to the VAT rate (Value Added Tax) in some parts of the developed world.
> > 
> > The comparison is being done incorrectly. Since hardened_malloc builds
> > both a lightweight and heavyweight library by default,
> 
> That claim is false.

You're not following the official approach to packaging and installing
hardened_malloc. It has 2 official build configurations and packaging
that's done properly includes both. We don't currently define other
configurations, but we could define a 'lightest' one too.

I've given both concise and detailed explanations here, which you've
gone out of the way to ignore.

The Makefile for commit 72fb3576f568481a03076c62df37984f96bfdfeb
> on Tue Aug 16 07:47:26 2022 -0400  (which is the HEAD of the trunk) begins
> =
> VARIANT := default
> 
> ifneq ($(VARIANT),)
> CONFIG_FILE := config/$(VARIANT).mk
> include config/$(VARIANT).mk
> endif
> 
> ifeq ($(VARIANT),default)
> SUFFIX :=
> else
> SUFFIX := -$(VARIANT)
> endif
> 
> OUT := out$(SUFFIX)
> =
> and builds only one library, namely $OUT/libhardened_malloc$SUFFIX.so
> which for the case of "no options specified" is out/libhardened_malloc.so .
> 
> If would be better for external perception if the name "libhardened_malloc.so"
> were changed to something like "libhardened_malloc-strong.so".
> Having both -strong and -light versions built every time
> would highlight the difference, and force the user to decide,
> and encourage the analysis that is required to make an informed choice.

The 2 default configurations are not the only choices. The light
configuration still has full zero-on-free and canaries enabled.

If we felt like matching or even exceeding glibc malloc performance on
microbenchmarks we could add an optional thread cache and a performance
configuration but it's not the point of the project at all, and glibc
malloc is not a high performance allocator. hardened_malloc can provide
similar performance with all optional features disabled vs. glibc malloc
with tcache disabled. If hardened_malloc had array-based thread caching
added (free lists would lose even the very basic 100% out-of-line
metadata security property) then with optional features disabled it
would be comparable to default glibc malloc configuration. We're already
done extensive testing. There's no thread cache included because it
simply isn't within the scope of it. It's a hardened allocator, and a
thread cache bypasses hardening and makes invalid free detection,
randomization, quarantines, and other features not work properly. It has
been tested with a thread cache. We know the impact of it. I don't think
it makes sense to use it with one.

> > already explained this and that the lightweight library still has
> > optional security features enabled, it doesn't seem to have been done in
> > good faith. My previous posts where I provided both concise and detailed
> > information explaining differences and the approach were ignored. Why is
> > that?
> > 
> > As I said previously, hardened_malloc has a baseline very hardened
> > allocator design. It also has entirely optional, expensive security
> > features layered on top of that. I explained in detail that some of
> > those features have a memory cost. Slab allocation canaries have a small
> > memory cost and slab allocation quarantines have a very large memory
> > cost especially with the default configuration. Those expensive optional
> > features each have an added performance cost too.
> > 
> > Measuring with 100% of the expensive optional features enabled and
> > trying to portray the performance of the allocator solely based on that
> > is simply incredibly misleading and disregards all of my previous posts
> > in the thread.
> 
> I measured the result of building and using with the default options.
> Unpack the source, use "as-is" with no adjustment, no tweaking, no tuning.
> If the default source is not appropriate to use as widely as implied
> by the name "malloc" (with no prefix and no suffix on the subroutine name),
> then the package is not suitable for general use.
> Say so immediately at the beginning of the README.md: "This software
> is not suitable for widespread general use, unless adjusted according to
> the actual use cases."

The hardened_malloc project is perfectly suitable for general purpose
use and heavily preferring security over both performance and memory
usage for one of the 2 default configurations doesn't make it any less
general purpose. The chosen compromises do not impact whether or not it
is a general purpose allocator. Both default configurations are 

Re: hardened malloc is big and slow

2022-09-05 Thread Daniel Micay via devel
On Wed, Aug 31, 2022 at 05:59:42PM +0200, Pablo Mendez Hernandez wrote:
> Adding Daniel for awareness.

Why was the heavyweight rather than lightweight configuration used? Why
compare with all the expensive optional security features enabled? Even
the lightweight configuration has 2 of the optional security features
enabled: slab canaries and full zero-on-free. Both of those should be
disabled to measure the baseline performance. Using the heavyweight
configuration means having large slab allocation quarantines and not
just zero-on-free but checking that data is still zeroed on allocation
(which more than doubles the cost), slot randomization and multiple
other features. It just doesn't make sense to turn security up to 11
with optional features and then present that as if it's the performance
offered.

I'm here to provide clarifications about my project and to counter
incorrect beliefs about it. I don't think it makes much sense for Fedora
to use it as a default allocator but the claims being made about memory
usage and performance are very wrong. I already responded and provided
both concise and detailed explanations. I don't know what these nonsense
measurements completely disregarding all that are meant to demonstrate.

It's a huge hassle for me to respond here because I have no interest in
this list and don't want to be subscribed to it. I didn't propose that
Fedora uses it and don't think it makes sense for Fedora. At the same
time I already explained that glibc malloc is ALSO a very bad choice in
detail. Linux distributions not willing to sacrifice much for security
would be better served by using jemalloc with small chunk sizes on 64
bit operating systems. ASLR is too low entropy on 32 bit to afford the
sacrifice of a few bits for chunk alignment though. It can be configured
with extra sanity checks enabled and with certain very non-essential
features disabled to provide a better balance of security vs.
performance. The defaults are optimized for long running server
processes. It's very configurable, including by individual applications.

hardened_malloc builds both a lightweight and heavyweight library
itself. The lightweight library still has the optional slab allocation
canary and full zero-on-free features enabled. Both those should be
disabled to truly measure the baseline cost. None of those optional
features is provided by glibc malloc. None of them is needed to get the
benefits of hardened_malloc's 100% out-of-line metadata, 100% invalid
free detection, entirely separate never reused address space regions for
all allocator metadata and each slab allocation size class (which covers
up to 128k by default), virtual memory quarantines + random guards for
large allocations, etc. etc.

The optional security features are optional because they're expensive.
That's the point of building both a sample lightweight and heavyweight
configuration by default. Lightweight configuration is essentially the
recommended configuration if you aren't willing to make more significant
sacrifices for security. It's not the highest performance configuration
it offers, just a reasonable compromise.

Slab allocation canaries slightly increase memory usage. Slab allocation
quarantines (disabled in lightweight configuration, which is built by
default) greatly increase memory usage, especially with the default
configuration. The whole point of quarantines is that they delay reuse
of the memory and since these are slab allocations within slabs the
memory gets held onto.

If you wanted to do measure the baseline performance, then you'd do as I
suggested and measure with all the optional features disabled (disable
at least those 2 features included in optional) and compare that to both
glibc malloc and glibc malloc with tcache disabled.

I explained previously that hardened_malloc could provide an array-based
thread cache as an opt-in feature, but currently it isn't done because
it inherently reduces security. No more 100% reliable detection of all
invalid frees and a lot more security properties lost. Also hardly makes
sense to have optional features like quarantines and slot randomization
underneath unless the thread caches are doing the same thing.

As I said previously, if you compare hardened_malloc with optional
features disabled to glibc malloc with tcache disabled, it performs as
well and has much lower fragmentation and lower metadata overhead. If
you stick a small array-based thread cache onto hardened_malloc, then it
can perform as well as glibc with much larger freelist-based thread
caches since it has a different approach to scaling with jemalloc-style
arenas.
___
devel mailing list -- devel@lists.fedoraproject.org
To unsubscribe send an email to devel-le...@lists.fedoraproject.org
Fedora Code of Conduct: 
https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: 

Re: hardened malloc is big and slow

2022-09-05 Thread Daniel Micay via devel
On Wed, Aug 31, 2022 at 10:19:51AM -0700, John Reiser wrote:
> > Bottom line opinion: hardened_malloc ... costs too much.
> 
> Attempting to be constructive: Psychologically, I might be willing to pay
> a "security tax" of something like 17%, partly on the basis of similarity
> to the VAT rate (Value Added Tax) in some parts of the developed world.

The comparison is being done incorrectly. Since hardened_malloc builds
both a lightweight and heavyweight library by default, and since I
already explained this and that the lightweight library still has
optional security features enabled, it doesn't seem to have been done in
good faith. My previous posts where I provided both concise and detailed
information explaining differences and the approach were ignored. Why is
that?

As I said previously, hardened_malloc has a baseline very hardened
allocator design. It also has entirely optional, expensive security
features layered on top of that. I explained in detail that some of
those features have a memory cost. Slab allocation canaries have a small
memory cost and slab allocation quarantines have a very large memory
cost especially with the default configuration. Those expensive optional
features each have an added performance cost too.

Measuring with 100% of the expensive optional features enabled and
trying to portray the performance of the allocator solely based on that
is simply incredibly misleading and disregards all of my previous posts
in the thread.

hardened_malloc builds both a lightweight and heavyweight library by
default. The lightweight library still has 2 of the optional security
features enabled. None of the optional security features is provided by
glibc malloc and if you want to compare the baseline performance then
none of those should be enabled for a baseline comparison.

Take the light configuration, disable slab allocation canaries and full
zero-on-free, and there you go.

I also previously explained that hardened_malloc does not include a
thread cache for security reasons inherent to the concept of a thread
cache. An array-based thread cache with out-of-line metadata would still
hurt security, but would be a more suitable approach than a free list
compromising the otherwise complete lack of inline metadata.

Compare hardened_malloc with the optional security features disabled to
glibc malloc and also to glibc malloc with tcache disabled. It's easy
enough to stick a thread cache onto hardened_malloc and if there was
demand for that I could implement it in half an hour. At the moment, the
current users of hardened_malloc don't want to make the sacrifice of
losing 100% reliable detection of invalid frees along with the many
other benefits lost by doing that.
___
devel mailing list -- devel@lists.fedoraproject.org
To unsubscribe send an email to devel-le...@lists.fedoraproject.org
Fedora Code of Conduct: 
https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: 
https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org
Do not reply to spam, report it: 
https://pagure.io/fedora-infrastructure/new_issue


Re: hardened memory allocate port to linux-fedora system for secutiry

2022-08-29 Thread Daniel Micay via devel
On Mon, Aug 15, 2022 at 07:39:46PM -0700, John Reiser wrote:
> On 8/13/22, Demi Marie Obenour wrote:
> > On 8/13/22, Kevin Kofler via devel wrote:
> > > martin luther wrote:
> > > > should we implement https://github.com/GrapheneOS/hardened_malloc/
> > > > it is hardened memory allocate it will increase the security of fedora
> > > > according to the graphene os team it can be ported to linux as well need
> > > > to look at it
> > 
> > CCing Daniel Micay who wrote hardened_malloc.
> > 
> > > There are several questions that come up:  [[snip]]
> 
> It seems to me that hardened_malloc could increase working set and RAM
> desired by something like 10% compared to glibc for some important workloads,
> such as Fedora re-builds.  From page 22 of [1] (attached here; 203KB), the 
> graph
> of number of requests versus requested size shows that blocks of size <= 128
> were requested tens to thousands of times more often than all the rest.

The lightweight configuration, hardened_malloc uses substantially less
memory for small allocations than glibc malloc.

None of the GrapheneOS or hardened_malloc developers or project members
has proposed that Fedora switch to hardened_malloc, but it would reduce
rather than increasing memory usage if you used in without the slab
quarantine features. Slab canaries use extra memory too, but the
overhead is lower than glibc metadata overhead. The sample lightweight
configuration still uses slab canaries.

If you bolted on a jemalloc-style array-based thread cache or a
problematic TCMalloc-style one as was copied for glibc, then you would
be able to get comparable performance and better scalability than glibc
malloc, but that is outside the scope of what hardened_malloc is
intended to provide. We aren't trying to serve that niche in
hardened_malloc. It does not mean that glibc malloc is well suited to
being the chosen allocator. That really can't be justified for any
technical reasons. If you replaced glibc malloc with jemalloc, the only
people who would be unhappy are people who care about the loss of ASLR
bits from chunk alignment, which if you make the chunks small enough and
configure ASLR properly really doesn't matter on 64-bit. I can't think
of a case where glibc malloc would be better than jemalloc with small
chunk sizes when using either 4k pages with a 48-bit address space or
larger pages. glibc malloc's overall design is simply not competitive
anymore, and it wastes tons of memory from both metadata overhead and
also fragmentation. I can't really understand what justification there
would be for not replacing it outright with a more modern design and
adding the necessary additional APIs required for that as we did
ourselves for our own security-focused allocator.

> For sizes from 0 through 128, the "Size classes" section of README.md of [2]
> documents worst-case internal fragmentation (in "slabs") of 93.75% to 11.72%.
> That seems too high.  Where are actual measurements for workloads such as
> Fedora re-builds?

The minimum alignment is 16 bytes. glibc malloc has far more metadata
overhead, internal and external fragmentation than hardened_malloc in
reality. It has headers on allocations, rounds to much less fine grained
bucket sizes and fragments all the memory with the traditional dlmalloc
style approach. There was a time when that approach was a massive
improvement over past ones but that time was the 90s, not 2022.

> (Also note that the important special case of malloc(0), which is analogous
> to (gensym) of Lisp and is implemented internally as malloc(1), consumes
> 16 bytes and has a fragmentation of 93.75% for both glibc and hardened_malloc.
> The worst fragmentation happens for *every* call to malloc(0), which occurred
> about 800,000 times in the sample.  Yikes!)

glibc malloc has headers giving it more than 100% pure overhead for a 16
byte allocation. It cannot do finer grained rounding than we do for 16
through 128 bytes, and sticking headers on allocations makes it far
worse. It also gets even worse with aligned allocations, such as common
64 byte aligned allocations, where slab allocation means any allocation
up to the page size already has their natural alignment such as 64 byte
for 64 byte, 128 byte for 128 byte, 256 byte for 256 byte, etc.

0 byte doesn't really make sense to compare because in hardened_malloc
it's a pointer to non-allocated pages with PROT_NONE memory protection.
___
devel mailing list -- devel@lists.fedoraproject.org
To unsubscribe send an email to devel-le...@lists.fedoraproject.org
Fedora Code of Conduct: 
https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: 
https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org
Do not reply to spam, report it: 
https://pagure.io/fedora-infrastructure/new_issue


Re: hardened memory allocate port to linux-fedora system for secutiry

2022-08-26 Thread Daniel Micay via devel
On Mon, Aug 15, 2022 at 07:39:46PM -0700, John Reiser wrote:
> On 8/13/22, Demi Marie Obenour wrote:
> > On 8/13/22, Kevin Kofler via devel wrote:
> > > martin luther wrote:
> > > > should we implement https://github.com/GrapheneOS/hardened_malloc/
> > > > it is hardened memory allocate it will increase the security of fedora
> > > > according to the graphene os team it can be ported to linux as well need
> > > > to look at it
> > 
> > CCing Daniel Micay who wrote hardened_malloc.
> > 
> > > There are several questions that come up:  [[snip]]
> 
> It seems to me that hardened_malloc could increase working set and RAM
> desired by something like 10% compared to glibc for some important workloads,
> such as Fedora re-builds.  From page 22 of [1] (attached here; 203KB), the 
> graph
> of number of requests versus requested size shows that blocks of size <= 128
> were requested tens to thousands of times more often than all the rest.

It has far less fragmentation than glibc malloc. It also has far lower
metadata overhead since there are no headers on allocations and only a
few bits consumed per small allocation. glibc has over 100% metadata
overhead for 16 byte allocations while for hardened_malloc it's a very
low percentage. Of course, you need to compare with slab allocation
quarantines and slab allocation canaries disabled in hardened_malloc.

> For sizes from 0 through 128, the "Size classes" section of README.md of [2]
> documents worst-case internal fragmentation (in "slabs") of 93.75% to 11.72%.
> That seems too high.  Where are actual measurements for workloads such as
> Fedora re-builds?

Internal malloc means fragmentation caused by size class rounding. There
is no way to have size classes that aren't multiples of 16 due to it
being required by the x86_64 and arm64 ABI. glibc has over 100% overhead
for 16 byte allocations due to header metadata and other metadata. It
definitely isn't lighter for those compared to a modern slab allocator.

There's a 16 byte alignment requirement for malloc on x86_64 and arm64
so there's no way to have any size classes between the initial multiples
of 16.

Slab allocation canaries are an optional hardened_malloc feature adding
8 byte random canaries to the end of allocations, which in many cases
will increase the size class if there isn't room within the padding.
Slab allocation quarantines are another optional feature which require
dedicating substantial memory to avoiding reuse of allocations.

You should compare without the optional features enabled as a baseline
because glibc doesn't have any of those security features, and the
baseline hardened_malloc design is far more secure.

> (Also note that the important special case of malloc(0), which is analogous
> to (gensym) of Lisp and is implemented internally as malloc(1), consumes
> 16 bytes and has a fragmentation of 93.75% for both glibc and hardened_malloc.
> The worst fragmentation happens for *every* call to malloc(0), which occurred
> about 800,000 times in the sample.  Yikes!)

malloc(0) is not implemented as malloc(1) in hardened_malloc and does
not use any memory for the data, only the metadata, which is a small
percentage of the allocation size even for 16 byte allocations since
there is only slab metadata for the entire slab and bitmaps to track
which slots are used. There are no allocation headers.

Doing hundreds of thousands of malloc(0) allocations only uses a few
bytes of memory in hardened_malloc. Each allocation requires a bit in
the bitmap and each slab of 256x 16 byte allocations (4096 byte slab)
has slab metadata. All the metadata is in a dedicated metadata region.

I strong recommend reading all the documentation thoroughly:

https://github.com/GrapheneOS/hardened_malloc/blob/main/README.md

hardened_malloc is oriented towards security and provides a bunch of
important security properties unavailable with glibc malloc. It also has
lower fragmentation and with the optional security features disabled
also lower memory usage for large processes and especially over time. If
you enable the slab quarantines, that's going to use a lot of memory. If
you enable slab canaries, you give up some of the memory usage reduction
from not having per-allocation metadata headers. Neither of those
features exists in glibc malloc, jemalloc, etc. so it's not really fair
to enable the optional security features for hardened_malloc and compare
with allocators without them.

Slab allocation quarantines in particular inherently require a ton of
memory in order to delay reuse of allocations for as long of a time as
is feasible. This pairs well with zero-on-free + write-after-free-check
based on zero-on-free, since if any non-zero write occurs while
quarantined/freed it will be detected before the allocation is reused.
As long as zero-on-free is enabled, which it is even for the sample
light configuration, then all memory is known to be zeroed at allocation
time, which is how the write-after-free-check works. All of