Re: [E-devel] Callback arrays and callback invocation optimisations

The Rasterman Sun, 28 Aug 2016 22:09:13 -0700

On Fri, 26 Aug 2016 22:14:11 -0700 Cedric BAIL <cedric.b...@free.fr> said:


> On Fri, Aug 26, 2016 at 5:44 PM, Carsten Haitzler <ras...@rasterman.com>
> wrote:
> > On Fri, 26 Aug 2016 12:15:57 -0700 Cedric BAIL <cedric.b...@free.fr> said:
> >> On Fri, Aug 26, 2016 at 2:46 AM, Tom Hacohen <t...@osg.samsung.com> wrote:
> >> > On 24/08/16 20:03, Cedric BAIL wrote:
> >> >> On Wed, Aug 24, 2016 at 2:24 AM, Tom Hacohen <t...@osg.samsung.com>
> >> >> wrote:
> >> >>> On 23/08/16 18:51, Cedric BAIL wrote:
> >> >>>> On Tue, Aug 23, 2016 at 3:31 AM, Tom Hacohen <t...@osg.samsung.com>
> >> >>>> wrote:
> 
> <snip>
> 
> >> >>>>> We can also store a pointer to the array in a hash table with the key
> >> >>>>> being some sort of a hash of the array in order to do some
> >> >>>>> deduplication afterwards (point to the same arrays, but obviously
> >> >>>>> different private data, so that would still be duplicated) if we
> >> >>>>> feel it's needed. It probably won't save as much though and will
> >> >>>>> have some running costs.
> >> >>>>
> >> >>>> For anything < 16 entries, I bet that a hash table will be slower than
> >> >>>> walking an array. Don't forget you need to compute the hash key, jump
> >> >>>> in an array, walk down a rbtree and finally iterate over a list. Hash
> >> >>>> are good for very large number of object, not for small number.
> >> >>>
> >> >>> That was an optimisation that I just threw out there to the world, but
> >> >>> I believe you misunderstood me. I didn't mean we create a hash table
> >> >>> for calling events, it was for saving memory and deduplicating event
> >> >>> callbacks (essentially callback arrays automatically). This is only
> >> >>> done on callback add/del.
> >> >>
> >> >> Indeed I missunderstood your intent. Still this will increase the cost
> >> >> of insertion for no benefit in my opinion. See below.
> >> >
> >> > Again, this is a side comment, not that important.
> >>
> >> I have discovered that this is an important use case actually. We do
> >> insert and remove callback quite a lot now that animator is an event
> >> on an object. We do already spend nearly as much time doing
> >> efl_event_callback_priority_add (0.90% for 110 000 calls),
> >> efl_event_callback_array_priority_add (0.49% for 110 000 calls),
> >> efl_event_callback_array_del (0.27% for 40 000 calls) and
> >> efl_event_callback_del (0.93% for 110 000 calls). The cost of adding
> >> and destroying events is not negligeable. It is, cumulatively,
> >> actually higher than our current cost for calling events.
> >
> > something to optimize. :) having priority doesn't help us a lot here as we
> > have to walk a lot of list nodes and pull in a lot of l1 cache pages to do
> > the insert given priority. if we sort also by event ptr it wont really
> > matter.
> >
> > if we move to an array of event ptrs for events then we likely will pull in
> > all existing event id's in 1 l1 cache line fetch and this will very likely
> > make insert and remove a lot faster.
> 
> Uh ? Removing from an array is costly as you have to deal with hole !
> That's why we rely on list so much. How do you think we can deal

instead of just babbling on... i did some actual tests. arrays come out ahead of
lists for my tests of 8 event callbacks if you dirty all of l1/l2/l3 cache
then arrays win for 8 events which is about the balllpark of our average
callback count. it's an isolated test where i duplicated the event cb add/del
code from eo and write some array equivalent code and timed it. my test uses
10k objects as the testbed and with 5 different event id's being used. when i
pollute all cache arrays are about 5% faster than the linked list next ptr
setup just for adding and deleting callbacks. i didn't check "finding events
to call them".

so the theory holds for totally uncached data. the more is in l1 cache the
faster the linked list method gets vs arrays.

> faster with array ? Doing an sorted insert on priority add, should
> actually help to make destruction faster... at likely the cost of

actually i just looked at the eo code... its a lot less efficient that it
should be. it will make a cb for deletion then later walk all event cb's and
then free the ones with delete_me. it should only do this while walking events
on that object, otherwise it should do it immediately.

if events were ordered by event id it would raise insertsion cost as right now
we only insert base on priority not event id and in most cases that means a
priority of 0 and most if not all events are 0 so it means just a prepend on
the head of the list.

> increasing insertion... Not so sure there is much to be done there. I
> may be wrong, but maybe an easy way to speedup allocation/destruction
> of single callback is to have an allocation cache using Eina_Trash,
> will try that next week. I don't have the benchmark result at hand
> anymore.

right now we calloc/free callback structs. maybe use mempool?

> > to do this we have to redesign our internal structures. i think with
> > callback arrays we should also have the callback array content duplicated
> > internally in more optimal memory layout thus it no longer NEEDS to be
> > "static const" and stay around for as long as the object does. we will
> > spend some more cycles hashing the cb array and looking up an existing copy
> > and refcounting it. yes this means it will be another l1 cache fetch
> > possible, tho its a shared bit of mem so if use often it may be in l1 cache
> > already unlike the per-object cb arrays.
> 
> Outch deduplication will make insertion and removal way more costly !
> Right now, we just prepend in a short list, going with deduplication
> will be imply hashing and compare. This is clearly going to have a
> high cost for all add/del operation. We can't reduce the call to
> registering events on object, we can on emitting events. This
> prioritize the design decision we are making here.

we only maybe use 10 or 20 cb arrays. that means doing a hash and a memcmp most
likely to find the match (likely found first go with a simple 4, 6 or 8bit
hash). the problem is deletion means keeping the original ptr which means we
need to keep the original array around anyway... which means... we may as well
use it directly.

> <snip>
> 
> >> As for the hash and compare, this is a reference to your previous
> >> comment saying that you could deduplicate the callbacks after they are
> >> inserted. I don't see how you can implement a deduplication without a
> >> hash and compare. And this likely as to be done at insertion time (and
> >> at removal too).
> >
> > the de-dup would likely use a hash to avoid larger more expensive "memcmp's"
> > against several instances. also by duplicating we make this doable via
> > bindings which is currently is not (except c++). ALSO with a relayout we
> > likely can make insert and deletion faster.
> 
> Uh ? How ? Right now, it is just walking a short list and inserting
> in. There is no hash computation, no memcmp, no memcpy, no additional
> cost. Sure only C++ can easily leverage the array, but I am not sure
> it is impossible to do in JS. I don't know for other binding, but
> clearly this should be investigated as it is way more efficient
> (faster insertion/deletion of an array than anything else added with
> the lower memory usage).
> 
> <snip>
> 
> >> > "Bigger fish to fry" - I fried these fish a lot. Maybe there's still
> >> > room for improvement, but if there is, not much. They are just called a
> >> > shitload of times. If I remember correctly, _efl_object_call_end is one
> >> > line: _eo_unref(obj). And that one is essentially if (--(obj->ref) == 0)
> >> > and then just returns in 99.99% of the cases. Not a lot to optimise
> >> > there. :)
> >>
> >> Sure. I offer you a present. 15% speed improvement on resolve, 20%
> >> speed improvement on scope_get, 50% speed improvement on efl_isa. Took
> >> me longer to write this email and efl_event_callback_call is still
> >> below 1%.
> >>
> >> If you have time, I have not worked on optimizing edje_part_recalc, it
> >> is a big contender for improvement ;-)
> >
> > it is. :) though tbh focus more on memory here... man edje is a bloated mess
> > when it comes to memory... :(
> 
> If I had time, I would spend some serious time optimizing edje. It has
> been at least 5 years since I am saying so... I know :-(

actually some of the optimizations you did like mempools per group ... really
bloated stuff out. :) yes i know. it's an optimization to free all parts for a
group in one go for example (and alloc at once) but it bloated mem out a lot
as every group that is indexed in an edje file - if we use it or not, has
mempools allocated AND had pointers (8 bytes on 64bit) to each mempool.. er
part type... that was a LOT of memory. :)

... actually i notice an api defect in eo... our actual api is
callback_priority_add() - callback_add is just convenience assuming pri 0...
but you cant delete a specific callback WITH a specific priority. the add and
del api's are not following the same design pattern.

-- 
------------- Codito, ergo sum - "I code, therefore I am" --------------
The Rasterman (Carsten Haitzler)    ras...@rasterman.com


------------------------------------------------------------------------------
_______________________________________________
enlightenment-devel mailing list
enlightenment-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/enlightenment-devel

Re: [E-devel] Callback arrays and callback invocation optimisations

Reply via email to