Re: [E-devel] Callback arrays and callback invocation optimisations

The Rasterman Fri, 26 Aug 2016 18:02:05 -0700

On Fri, 26 Aug 2016 12:15:57 -0700 Cedric BAIL <cedric.b...@free.fr> said:


> On Fri, Aug 26, 2016 at 2:46 AM, Tom Hacohen <t...@osg.samsung.com> wrote:
> > On 24/08/16 20:03, Cedric BAIL wrote:
> >> On Wed, Aug 24, 2016 at 2:24 AM, Tom Hacohen <t...@osg.samsung.com> wrote:
> >>> On 23/08/16 18:51, Cedric BAIL wrote:
> >>>> On Tue, Aug 23, 2016 at 3:31 AM, Tom Hacohen <t...@osg.samsung.com> 
> >>>> wrote:
> >>
> >> <snip>
> >>
> >>>>> However, while they provide a nice memory improvement, they have been
> >>>>> hampering many optimisation strategies that would make callback
> >>>>> invocation significantly faster. Furthermore, maybe (not sure), we can
> >>>>> automatically de-duplicate event lists internally (more on that in a
> >>>>> moment). With that being said, there is a way we can maybe keep array
> >>>>> callbacks with some limitations.
> >>>>
> >>>> Do you have a case where performance are impacted by callback today ?
> >>>> I have found that we usually have a very small number of callbacks
> >>>> (likely in an array this days) and when speed did really matter it was
> >>>> just best to not trigger the callback at all (That's why we have this
> >>>> code in many place that count if any callback has been registered).
> >>>
> >>> It always showed up in callgrind. Obviously after you did your changes
> >>> that improved things, because you essentially just don't call that code,
> >>> but having to do this everywhere is a bit of a pain, especially if we
> >>> can just make callbacks fast on their own.
> >>>
> >>> Callback_call takes around 1.5% in the efl atm. Though if we remove the
> >>> not-call optimisations it would be much more again. I wonder if we can
> >>> reach good results without it.
> >>
> >> When genlist is scrolling, just calling a function is costly as we end
> >> up calling it million times, litterally. I seriously doubt it is
> >> possible.
> >
> > And yet, this is one of the functions that stand out and not others that
> > are "just called".
> 
> Can you share your test case ? I can't reproduce a way where it does
> stand out. It barelly register at 1% in my benchmark. It is maybe in
> position 20 of the most costly function call. After malloc, free,
> mutex_lock/unlock, eina_hash_superfast (I think I can optimize that
> one easily).

speaking of malloc/free overhead:

https://www.facebook.com/notes/facebook-engineering/scalable-memory-allocation-using-jemalloc/480222803919/

scroll to the bottom for some graphs. and the graphs are not the raw
malloc/free performance itself but overall "webserver throughput" as a result
of switching malloc implementations. we might not get as much a boost...

it smells like it'd be good to have our own allocators for as much memory as we
can and maybe ship jemalloc with eina. no - don't REPLACE libc malloc/free
entirely for a process, but at least our own usage of it where our external
api doesn't require someone free something with libc free().


> >>>  From my tests back when I was optimising callback invocation, we had
> >>> around 5 callbacks on average on objects with non-zero number of
> >>> registered callbacks with a maximum number of around 12 if my memory
> >>> serves, so this could potentially make callback calls so fast any
> >>> optimisations won't matter.
> >>
> >> This number where from before callbacks array. I am seriously
> >> interested to know todays number. Also an improved statistic would be
> >> to know how many callbacks are walked over in the most called case and
> >> how many of those callbacks are actually in an array already.
> >>
> >> <snip>
> >
> > Callback array or not, you still end up walking all of the callbacks...
> 
> Sure, but if you only have mostly one sorted callbacks array, that
> doesn't really matter. You will be faster in the main use case.
> 
> >>>>> We can also store a pointer to the array in a hash table with the key
> >>>>> being some sort of a hash of the array in order to do some deduplication
> >>>>> afterwards (point to the same arrays, but obviously different private
> >>>>> data, so that would still be duplicated) if we feel it's needed. It
> >>>>> probably won't save as much though and will have some running costs.
> >>>>
> >>>> For anything < 16 entries, I bet that a hash table will be slower than
> >>>> walking an array. Don't forget you need to compute the hash key, jump
> >>>> in an array, walk down a rbtree and finally iterate over a list. Hash
> >>>> are good for very large number of object, not for small number.
> >>>
> >>> That was an optimisation that I just threw out there to the world, but I
> >>> believe you misunderstood me. I didn't mean we create a hash table for
> >>> calling events, it was for saving memory and deduplicating event
> >>> callbacks (essentially callback arrays automatically). This is only done
> >>> on callback add/del.
> >>
> >> Indeed I missunderstood your intent. Still this will increase the cost
> >> of insertion for no benefit in my opinion. See below.
> >
> > Again, this is a side comment, not that important.
> 
> I have discovered that this is an important use case actually. We do
> insert and remove callback quite a lot now that animator is an event
> on an object. We do already spend nearly as much time doing
> efl_event_callback_priority_add (0.90% for 110 000 calls),
> efl_event_callback_array_priority_add (0.49% for 110 000 calls),
> efl_event_callback_array_del (0.27% for 40 000 calls) and
> efl_event_callback_del (0.93% for 110 000 calls). The cost of adding
> and destroying events is not negligeable. It is, cumulatively,
> actually higher than our current cost for calling events.

something to optimize. :) having priority doesn't help us a lot here as we have
to walk a lot of list nodes and pull in a lot of l1 cache pages to do the
insert given priority. if we sort also by event ptr it wont really matter.

if we move to an array of event ptrs for events then we likely will pull in all
existing event id's in 1 l1 cache line fetch and this will very likely make
insert and remove a lot faster.

to do this we have to redesign our internal structures. i think with callback
arrays we should also have the callback array content duplicated internally in
more optimal memory layout thus it no longer NEEDS to be "static const" and
stay around for as long as the object does. we will spend some more cycles
hashing the cb array and looking up an existing copy and refcounting it. yes
this means it will be another l1 cache fetch possible, tho its a shared bit of
mem so if use often it may be in l1 cache already unlike the per-object cb
arrays.

> I am not really sure of the exact cost structure, but it seems that
> the CALLBACK_ADD and CALLBACK_DEL are responsible for a fair amount of
> that cost. Maybe optimizing those would be useful. I can see the event
> cather being call around 350 000 times with a total cost around 0.15%,
> but that seems too low to explain all the overall cost. So not sure as
> I said, need more investigation.
> 
> >>>>> The last idea is to keep callback arrays, but kind of limit their scope.
> >>>>> The problem (or at least one of them) is that callback arrays support
> >>>>> setting a priority which means calling them needs to be in between the
> >>>>> calls to normal callbacks. This adds a lot of complexity (this is a very
> >>>>> hot path, even a simple if is complexity, but this adds more). If we
> >>>>> define that all callback arrays are always the lowest priority (called
> >>>>> last), which in practice will have almost zero impact if at all, we can
> >>>>> just keep them, and just call them after we do the normal callback calls
> >>>>> (if they exist). We can even optimise further by not making the arrays
> >>>>> constant, and thus letting us sort them and then run the same algorithm
> >>>>> mentioned above for searching. This is probably the most acceptable
> >>>>> compromise, though I'm not sure if it'll block any future optimisation
> >>>>> attempts that I'm not able to foresee.
> >>>>
> >>>> No ! Array are only useful if they are constant ! That is the only way
> >>>> to share them accross all instance of object. Their size being
> >>>> ridiculously small, I bet you won't win anything in reordering them.
> >>>> And if you really want to reorder them, you can do that once at
> >>>> creation time in the inline function that create them as defined in
> >>>> Eo.h.
> >>>
> >>> That is absolutely untrue. You can reorder them where they are created
> >>> (like you suggested), or reorder them when they are added and still
> >>> share them. You'll only need to reorder once, after that, when they are
> >>> in order, that's it. Const doesn't matter or help at all. Obviously
> >>> you're expected not to change them.
> >>
> >> If the array is not const, then you have to allocate it every time you
> >> register it. This has a direct cost. Adding the fact you have then to
> >> sort it out, hash, compare and maybe free it. I seriously doubt the
> >> wisdom of doing so.
> >>
> >> As said above, sort it at creation, add debug code that will warn if
> >> inserting unsorted array (code that will be disabled in production)
> >> and just improve walking on those sorted array. I bet that will be
> >> enough of a speedup for our real use case if there is any (see below).
> >
> > Either you are missing something or I'm missing something. First of all,
> > yes, better to sort on creation, we agree on that.
> >
> > Const or not, in both cases it's going to be a static array, so
> > allocated once. I don't see const would change that. I also don't
> > understand what you mean by hash and compare. I think you are confusing
> > my previous optimisation suggestion (please strike it out of your
> > memory), that has *nothing* to do with hashing callback arrays, at least
> > not in this case.
> 
> Hum, if it is const, it means the called is not going to modify it and
> he doesn't give up on the ownership of that memory. My understanding
> is that when you remove const, you are going to take over the data.

actually you may take it over, or may modify what it points to... it's not
totally clear. it could mean nothing and someone is being lazy :)

> Now, this could be considered a special API and it could considered an
> ok behavior (not convinced) to do so and reuse the modified array
> accross multiple call. Still that would be confusing and why would you
> touch that array every next time you register it ? Once seems enough.

my take is we should duplicate the array internally ONCE and re-use the same
internal copy multiple times.that will allow us at any time to re-arrange
layout and content to make it more optimal (eg separate the event i'ds from the
func ptrs so we get the event id's continuous in memory for searching)

> As for the hash and compare, this is a reference to your previous
> comment saying that you could deduplicate the callbacks after they are
> inserted. I don't see how you can implement a deduplication without a
> hash and compare. And this likely as to be done at insertion time (and
> at removal too).

the de-dup would likely use a hash to avoid larger more expensive "memcmp's"
against several instances. also by duplicating we make this doable via bindings
which is currently is not (except c++). ALSO with a relayout we likely can make
insert and deletion faster.

the current api for eo_event_callback_array_del() tho relys on having the
exact same ptr to the same src cb array memory to delete it. that puts a
spanner in the works. it means you have to keep the original memory for the
array around at all times until deletion. i can't right now think of a way
around this.

> >>>>> I'm not a huge fan of callback arrays, but if they do save the memory
> >>>>> they claim to be saving, I see no problem with keeping a more limited
> >>>>> version of them that let us optimise everything in the manner described
> >>>>> above.
> >>>>
> >>>> I am not a huge fan of optimization without a clear real life case.
> >>>> Please share number and scenario of when it does matters. I have seen
> >>>> enough people wasting there time optimizing things that don't matters
> >>>> that I really take it with a grain of salt if you are not showing real
> >>>> life scenario. Sharing a callgrind trace or something along that line
> >>>> would really help make your point here.
> >>>
> >>> As I said, it's ~1.5% of the efl cpu usage when scrolling around
> >>> genlist. It also wastes our memory to have them support priority. And as
> >>> your changes proved, there is a reason to minimise callback calls, so we
> >>> already have a case, instead of letting everyone reimplement that
> >>> counting, it's better to just make callback calls fast. As I said, the
> >>> price is very small, all I'm asking for is removing priority from
> >>> callback arrays and always assume they are the lowest priority.
> >>
> >> You realize that as an optimization, you are fighting not calling a
> >> function, not walking an array, doing fetch and compare (even doing a
> >> dichotomic search). Pretty sure the benefit of not triggering the
> >> event will remain. Oh and there is plenty of case where, well, you
> >> will still do the optionnal propagation, like for animator.
> >>
> >> As for benchmarking, I did a quick run of 'ELM_TEST_AUTOBOUNCE=300
> >> valgrind --tool=callgrind elementary_test  -to genlist'. I see a 0.90%
> >> of the time spend in efl_event_callback_call (~400 000 calls) and
> >> 0.35% evas_object_event_callback_call (~500 000 calls). It is going to
> >> be very very hard to win anything on that.
> >>
> >> I see also way bigger fish to fish for :
> >>  - _efl_object_call_resolve 12.53%
> >>  - efl_data_scope_get 7.75%
> >>  - efl_isa 3.54%
> >>  - _efl_object_call_end 2.26%
> >>
> >> If you manage to win 10% on any of those, you will have managed more
> >> than if you reduce the cost of calling efl_event_callback_call to 0. I
> >> am really not convinced that you are focusing on the right problem at
> >> all here.
> >
> > As I said, the statement you are making now is not entirely fair. You
> > are essentially saying:
> > I found a very slow function that was showing up in our benchmarks. I
> > stopped calling it in a few cases, and now it doesn't show anymore, so
> > no need to optimise it.
> >
> > However, what happens when we use it again? Are we going to have to
> > chase it all around and block calls like you did? Isn't it better to
> > just make it "good enough for most cases" from the get go so next time
> > someone uses it a lot it doesn't show up in our benchmarks?
> 
> I bet we are. We are already moving to create Eo object for event
> info. eo_add/del is insanely costly compare to callback_call. We are
> going to avoid as much as we can firing events with complex structure.
> So yes, I do believe that you are trying to optimize the wrong thing.
> Sort the array at creation, check that it is at insertion time, do a
> fast walk on arrays, and be done with it. Seriously any more than that
> is likely going to have more side effect, performance and memory
> impact than you are thinking it will with absolutely no visible gain.
> 
> > "Bigger fish to fry" - I fried these fish a lot. Maybe there's still
> > room for improvement, but if there is, not much. They are just called a
> > shitload of times. If I remember correctly, _efl_object_call_end is one
> > line: _eo_unref(obj). And that one is essentially if (--(obj->ref) == 0)
> > and then just returns in 99.99% of the cases. Not a lot to optimise
> > there. :)
> 
> Sure. I offer you a present. 15% speed improvement on resolve, 20%
> speed improvement on scope_get, 50% speed improvement on efl_isa. Took
> me longer to write this email and efl_event_callback_call is still
> below 1%.
> 
> If you have time, I have not worked on optimizing edje_part_recalc, it
> is a big contender for improvement ;-)

it is. :) though tbh focus more on memory here... man edje is a bloated mess
when it comes to memory... :(

> Have fun.
> -- 
> Cedric BAIL
> 
> ------------------------------------------------------------------------------
> _______________________________________________
> enlightenment-devel mailing list
> enlightenment-devel@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/enlightenment-devel
> 


-- 
------------- Codito, ergo sum - "I code, therefore I am" --------------
The Rasterman (Carsten Haitzler)    ras...@rasterman.com


------------------------------------------------------------------------------
_______________________________________________
enlightenment-devel mailing list
enlightenment-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/enlightenment-devel

Re: [E-devel] Callback arrays and callback invocation optimisations

Reply via email to