Re: [E-devel] eina freeq?

The Rasterman Thu, 10 Nov 2016 15:29:42 -0800

On Thu, 10 Nov 2016 15:14:11 +0000 Mike Blumenkrantz
<michael.blumenkra...@gmail.com> said:


> I see that this has been pushed and is already being used despite some
> objections being raised? I guess I probably missed IRC discussions.

did you read the responses to the objections? in fact just one objection. it
was "dont like" and under valgrind frees are now asynchronous - read the code.
it was misunderstanding that this is a replacement of valgrind where it is not.
it's a stability measure for average users when valgrind is not in use by
delaying pointer re-use. it also can defer any work involved in freeing to
'when idle'. and if under valgrind the frees are done synchronously without
being queued so it catches everything as it does right now leaving it to
valgrind. so the issue was addressed if you read the code.

> The reasoning for needing this sounds like we should probably just use
> jemalloc (
> https://www.facebook.com/notes/facebook-engineering/scalable-memory-allocation-using-jemalloc/480222803919/
> ), which does more, has more people actively devoted to improving/developing
> it, and has been widely tested and benchmarked so we know it will
> definitely have the effect that we want while also reducing our
> maintenance+development overhead.

which actually is entirely orthogonal. and i've proposed a more extensive way
of having an allocator with multiple allocation domains - it could be jemalloc
WITH modifications. did you read the mail?

    Subject: [E-devel] memory allocation perf

but freeq works WITHOUT pulling in jemalloc. it enforces a purgatory for
pointers put in the freeq irrespective of malloc implementation. it's simple
and easy to work with. it actually does hat people THOUGHT eina_trash should do
but doesn't. freeq enables us to debug core dumps when people dont use valgrind
- which they don't. valgrind does not even work on openbsd, so proposing it as
a solution to help debug there is pointless. and people doing Qa dont use
valgrind. they dont even know how to AND it makes things so slow that they
never would even if it was just a checkbox. yes there is MALLOC_PERTURB_ but it
as no upper limit (free 1mb of data and 1mb is filled with pattern data), so it
has a pretty hefty impact. it's also glibc only.

malloc replacement is a far deeper topic. JUST blindly using jemalloc can't be
done as it'd replace all *alloc/free funcs, so the method to do that is an
LD_PRELOAD. to use it without a preload means wrapping it in some calls and
then replacing one thing at a time with this "eina_alloc" api (memory blob by
blob). i do not want to do that with JUST a plain void * replacement. it's not
as much of a win as it could be. if we have to go replace things pointer by
pointer, let's do better. with a custom alloc impl we can:

1. have allocation domains that separate memory out like:

    domain1 = eina_alloc_domain_new();
    ptr = eina_alloc(domain1, size);
    eina_alloc_free(domain1, ptr);

whereas having a NULL domain be like malloc/free as an implicit global:

    ptr = eina_alloc(NULL, size);

this allows the back-end to be a single pool or BETTER multiple
address-separated pools meaning we can ensure the pointers for one kind of data
don't rub shoulders with another which means use-after-free within code segment
A is LESS likely to corrupt data from code segment B if the memory they use
come from different domains.

2. do small pointers. 32bit ptrs on 64bit machines halving our memory footprint
for pointers (and we use a LOT of pointers). this may actually even sped things
up as we pollute our caches less. a single 32bit domain for almost all of efl's
allocations "should be enough for everything" as if we 16 byte align we
actually have access to up to 64gb of allocated memory, and 32gb if we 8 byte
align which would be necessary anyway (probably a good call given simd
implementations can do 128bit types these days, and maybe 32byte align on
systems with 256 bit type support). i am not proposing we use 32bit ptrs for
allocating audio data or video or images - the large blobs. just "data
structures". we can then ALSO do 16bit pointers too (each domain could allocate
up to 1m of data with 16byte alignment, 512k with 8). so if we use domains
carefully we can drop pointer sizes down even more for certain special cases
(assuming the whole workload of memory use for that domain would always remain
small enough to fit inside the domain max pool of eg 512k or 1m). yes. now
access to these small ptrs requires resolving them like:

   realptr = domain->base + (ptr << 4);

likely a macro or static inline would do this. yes. it requires KNOWING the
domain the ptr comes from. we COULD have the NULL domain be a special "global
domain" like malloc/free do and so domain is a known global var
(eina_alloc_domain, eina_alloc_domain_32, eina_alloc_domain_16 - with plain
domain and 32 domain maybe being #defined to be the same thing on a 32bit
machine).

either way this requires more than just snarfing in jemalloc into our tree or
linking to a jemalloc then wrapping and using. if we're going to do this, then
let's get it to do more than just give is an alternate malloc impl which can be
done entirely without us lifting a finger via LD_PRELOAD anyway :)

> On Fri, Nov 4, 2016 at 10:08 AM Carsten Haitzler <ras...@rasterman.com>
> wrote:
> 
> > On Fri, 4 Nov 2016 10:18:33 -0200 Gustavo Sverzut Barbieri <
> > barbi...@gmail.com>
> > said:
> >
> > > On Thu, Nov 3, 2016 at 9:27 PM, Carsten Haitzler <ras...@rasterman.com>
> > wrote:
> > > > On Thu, 3 Nov 2016 11:24:14 -0200 Gustavo Sverzut Barbieri
> > > > <barbi...@gmail.com> said:
> > > >
> > > >> I guessed mempool and eina_trash did that
> > > >
> > > > nah - mempool i don't think has a "purgatory" for pointers.
> > > > they are released back into the pool.
> > >
> > > well, it could... OTOH it's just for "empty blocks", since if it's in
> > > a mempool that has memory blocks and they're still in use, it will
> > > just flag as unused.
> > >
> > > also, it simplifies bookkeeping of the memory if they are all of the
> > > same size, like you said Eina_List, it knows the size of each entry,
> > > thus just need to mark each position that is usable, not try to
> > > allocate based on size or similar -- much more efficient.
> >
> > yah. that's what mempool does... but it doesnt have 2 states for an
> > allocation.
> > it doesnt have "in use" "freed but not able to be reused yet" and "free and
> > able to be re-used". it just has 1. in use or not.
> >
> > > > trash is actually a cache for storing ptrs but it never
> > > > actually frees anything. it doesn't know how to. you have to manually
> > clean
> > > > trash yourself and call some kind of free func when you do the clean.
> > trash
> > > > doesn't store free funcs at all.
> > >
> > > I don't see why it couldn't.
> >
> > but it doesn't, and eina_trash is all static inlines with structs exposed
> > so
> > we'd break struct definition, memory layout and api to do this. if an
> > eina_trash is exposed from a lib compiled against efl 1.18 against other
> > code
> > compiled against 1.19 - it'd break. even worse eina_trash is a single
> > linked
> > list so walking through it is scattered through memory thus basically
> > likely a
> > cache miss each time.
> >
> > > but I find this is trying to replace malloc's internal structures,
> > > which is not so nice. As you know, malloc implementation can
> > > postpone/defer actual flushes, it's not 1:1 with brk() and munmap()
> > > since like our mempools the page or stack may have used bits that
> > > prevents that to be given back to the kernel.
> >
> > i know. but it's out of our control. we can't change what and how malloc
> > does
> > this. we can't do smarter overwrite detection. malloc has options for
> > filling
> > freed memory with a pattern - but it will do it to any sized allocation. 1
> > byte
> > or 1 gigabyte. with a custom implementation WE can decide eg only fill in
> > up to
> > 256 bytes as this is what might be sued for small objects/list nodes but
> > leave
> > big allocations untouched or .. only fill in the FIRST N bytes of an
> > allocation with a pattern. if the pattern has been overwritten between
> > submission to a free queue AND when it is actually freed then we have a
> > bug in
> > code somewhere scribbling over freed memory. at least we know it and know
> > what
> > to be looking for. malloc is far more limited in this way.
> >
> > also we can defer freeing until when WE want. e.g. after having gone idle
> > and
> > we would otherwise sleep. malloc really doesnt have any way to do this
> > nicely.
> > it's totally non-portable, libc specific (eg glibc) etc. and even then very
> > "uncontrollable". a free queue of our own is portable AND controllable.
> >
> > > what usually adds overhead are mutexes and the algorithms trying to
> > > find an empty block... if we say freeq/trash are TLS/single-thread,
> > > then we could avoid the mutex (but see malloc(3) docs on how they try
> > > to minimize that contention), but adding a list of entries to look for
> > > a free spot is likely worse than malloc's own tuned algorithm.
> >
> > no no. i'm not talking about making a CACHE of memory blocks. simply a
> > fifo.
> > put a ptr on the queue with a free func. it sits there for some time and
> > then
> > something walks this from beginning to end actually freeing. e.g. once we
> > have
> > reached and idle sleep state. THEN the frees really happen. once on the
> > free
> > queue there is no way off. you are freed. or to be freed. only a question
> > of
> > when.
> >
> > if there is buggy code that does something like:
> >
> > x = malloc(10);
> > x[2] = 10;
> > free(x);
> > y = malloc(10);
> > y[2] = 10;
> > x[2] = 5;
> >
> > ... there is a very good chance y is a recycled pointer - same mem
> > location as
> > x. when we do x[2] = 5 we overwrite y[2] with 5 even tho it now should be
> > 10.
> > yes. valgrind can catch these... but you HAVE to catch them while running.
> > maybe it only happens in certain logic paths. yes. coverity sometimes can
> > find
> > these too through static analysis. but not always. and then there are the
> > cases
> > where this behaviour is split across 2 different projects. one is efl, the
> > other is some 3rd party app/binary that does something bad. the "y" malloc
> > is
> > in efl. the c one is in an app. the app now scribbles over memory owned by
> > efl.
> > this is bad. so efl now crashes with corrupt data structures and we can
> > never
> > fix this at all as the app is a 3rd party project simply complaining that a
> > crash is happening in efl.
> >
> > we can REDUCE these issues by ensuring the x pointer is not recycled so
> > aggressively by having a free queue. have a few hundred or a few thousand
> > pointers sit on that queue for a while and HOPE this means the buggy code
> > will
> > write to this memory while its still allocated but not in use... thus
> > REDUCING
> > the bugs/crashes at the expense of latency on freeing memory. it doesn't
> > fix
> > the bug but it mitigates the worst side effects.
> >
> > of course i'd actually like to replace all our allocations with our own
> > special
> > allocator that keeps pointers and allocations used in efl separated out
> > into
> > different domains. e.g. eo can have a special "eo object data" domain and
> > all
> > eo object data is allocated from here. pointers from here can never be
> > recycled
> > for a strdup() or a general malloc() or an eina_list_append (that already
> > uses
> > a mempool anyway), etc. - the idea being that its HARDER to accidentally
> > stomp
> > over a completely unrelated data structure because pointers are not
> > re-cycled
> > from the same pool. e.g. efl will have its own pool of memory and at least
> > if
> > pointers are re-used, they are re-used only within that domain/context. if
> > we
> > are even smarter we can start using 32bit pointers on 64bit by returning
> > unisigned ints that are an OFFSET into a single 4gb mmaped region. even
> > better
> > bitshifting could give us 16 or 32 or even 64gb of available address space
> > for
> > these allocations if we force alignment to 4, 8 or 16 bytes (probably a
> > good
> > idea). so you access such ptrs with:
> >
> > #define P(dom, ptr) \
> > ((void *)(((unsigned char *)((dom)->base)) + (((size_t)ptr) << 4))
> >
> > so as long as you KNOW the domain it comes from you can compress pointers
> > down
> > to 1/2 the size ... even 1/4 the size and use 16bit ptrs... like above.
> > (that
> > would give you 1mb of memory space per domain so for smallish data sets
> > might
> > be useful). this relies on you knowing in advance the domain source and
> > getting
> > this right. we can still do full ptrs too. but this would quarantine
> > memory and
> > pointers from each other (libc vs efl) and help isolate bugs/problems.
> >
> > but this is a hell of a lot more work. it needs a whole malloc
> > implementation.
> > i'm not talking about that. far simpler. a queue of pointers to free at a
> > future point. not a cache. not trash. not to then be dug out and re-used.
> > that
> > is the job of the free func and its implementation to worry about that. if
> > it's
> > free() or some other free function. only put memory into the free queue
> > you can
> > get some sensible benefits out of. it's voluntary. just replace the
> > existing
> > free func with the one that queues the free with the free func and ptr and
> > size
> > for example. do it 1 place at a time - totally voluntary. doesn't hurt to
> > do it.
> >
> > --
> > ------------- Codito, ergo sum - "I code, therefore I am" --------------
> > The Rasterman (Carsten Haitzler)    ras...@rasterman.com
> >
> >
> >
> > ------------------------------------------------------------------------------
> > Developer Access Program for Intel Xeon Phi Processors
> > Access to Intel Xeon Phi processor-based developer platforms.
> > With one year of Intel Parallel Studio XE.
> > Training and support from Colfax.
> > Order your platform today. http://sdm.link/xeonphi
> > _______________________________________________
> > enlightenment-devel mailing list
> > enlightenment-devel@lists.sourceforge.net
> > https://lists.sourceforge.net/lists/listinfo/enlightenment-devel
> >
> ------------------------------------------------------------------------------
> Developer Access Program for Intel Xeon Phi Processors
> Access to Intel Xeon Phi processor-based developer platforms.
> With one year of Intel Parallel Studio XE.
> Training and support from Colfax.
> Order your platform today. http://sdm.link/xeonphi
> _______________________________________________
> enlightenment-devel mailing list
> enlightenment-devel@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/enlightenment-devel
> 


-- 
------------- Codito, ergo sum - "I code, therefore I am" --------------
The Rasterman (Carsten Haitzler)    ras...@rasterman.com


------------------------------------------------------------------------------
Developer Access Program for Intel Xeon Phi Processors
Access to Intel Xeon Phi processor-based developer platforms.
With one year of Intel Parallel Studio XE.
Training and support from Colfax.
Order your platform today. http://sdm.link/xeonphi
_______________________________________________
enlightenment-devel mailing list
enlightenment-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/enlightenment-devel

Re: [E-devel] eina freeq?

Reply via email to