Re: [E-devel] Callback arrays and callback invocation optimisations

2016-08-28 Thread The Rasterman
On Fri, 26 Aug 2016 22:14:11 -0700 Cedric BAIL  said:

> On Fri, Aug 26, 2016 at 5:44 PM, Carsten Haitzler 
> wrote:
> > On Fri, 26 Aug 2016 12:15:57 -0700 Cedric BAIL  said:
> >> On Fri, Aug 26, 2016 at 2:46 AM, Tom Hacohen  wrote:
> >> > On 24/08/16 20:03, Cedric BAIL wrote:
> >> >> On Wed, Aug 24, 2016 at 2:24 AM, Tom Hacohen 
> >> >> wrote:
> >> >>> On 23/08/16 18:51, Cedric BAIL wrote:
> >>  On Tue, Aug 23, 2016 at 3:31 AM, Tom Hacohen 
> >>  wrote:
> 
> 
> 
> >> > We can also store a pointer to the array in a hash table with the key
> >> > being some sort of a hash of the array in order to do some
> >> > deduplication afterwards (point to the same arrays, but obviously
> >> > different private data, so that would still be duplicated) if we
> >> > feel it's needed. It probably won't save as much though and will
> >> > have some running costs.
> >> 
> >>  For anything < 16 entries, I bet that a hash table will be slower than
> >>  walking an array. Don't forget you need to compute the hash key, jump
> >>  in an array, walk down a rbtree and finally iterate over a list. Hash
> >>  are good for very large number of object, not for small number.
> >> >>>
> >> >>> That was an optimisation that I just threw out there to the world, but
> >> >>> I believe you misunderstood me. I didn't mean we create a hash table
> >> >>> for calling events, it was for saving memory and deduplicating event
> >> >>> callbacks (essentially callback arrays automatically). This is only
> >> >>> done on callback add/del.
> >> >>
> >> >> Indeed I missunderstood your intent. Still this will increase the cost
> >> >> of insertion for no benefit in my opinion. See below.
> >> >
> >> > Again, this is a side comment, not that important.
> >>
> >> I have discovered that this is an important use case actually. We do
> >> insert and remove callback quite a lot now that animator is an event
> >> on an object. We do already spend nearly as much time doing
> >> efl_event_callback_priority_add (0.90% for 110 000 calls),
> >> efl_event_callback_array_priority_add (0.49% for 110 000 calls),
> >> efl_event_callback_array_del (0.27% for 40 000 calls) and
> >> efl_event_callback_del (0.93% for 110 000 calls). The cost of adding
> >> and destroying events is not negligeable. It is, cumulatively,
> >> actually higher than our current cost for calling events.
> >
> > something to optimize. :) having priority doesn't help us a lot here as we
> > have to walk a lot of list nodes and pull in a lot of l1 cache pages to do
> > the insert given priority. if we sort also by event ptr it wont really
> > matter.
> >
> > if we move to an array of event ptrs for events then we likely will pull in
> > all existing event id's in 1 l1 cache line fetch and this will very likely
> > make insert and remove a lot faster.
> 
> Uh ? Removing from an array is costly as you have to deal with hole !
> That's why we rely on list so much. How do you think we can deal

instead of just babbling on... i did some actual tests. arrays come out ahead of
lists for my tests of 8 event callbacks if you dirty all of l1/l2/l3 cache
then arrays win for 8 events which is about the balllpark of our average
callback count. it's an isolated test where i duplicated the event cb add/del
code from eo and write some array equivalent code and timed it. my test uses
10k objects as the testbed and with 5 different event id's being used. when i
pollute all cache arrays are about 5% faster than the linked list next ptr
setup just for adding and deleting callbacks. i didn't check "finding events
to call them".

so the theory holds for totally uncached data. the more is in l1 cache the
faster the linked list method gets vs arrays.

> faster with array ? Doing an sorted insert on priority add, should
> actually help to make destruction faster... at likely the cost of

actually i just looked at the eo code... its a lot less efficient that it
should be. it will make a cb for deletion then later walk all event cb's and
then free the ones with delete_me. it should only do this while walking events
on that object, otherwise it should do it immediately.

if events were ordered by event id it would raise insertsion cost as right now
we only insert base on priority not event id and in most cases that means a
priority of 0 and most if not all events are 0 so it means just a prepend on
the head of the list.

> increasing insertion... Not so sure there is much to be done there. I
> may be wrong, but maybe an easy way to speedup allocation/destruction
> of single callback is to have an allocation cache using Eina_Trash,
> will try that next week. I don't have the benchmark result at hand
> anymore.

right now we calloc/free callback structs. maybe use mempool?

> > to do this we have to redesign our internal structures. i think with
> 

Re: [E-devel] Callback arrays and callback invocation optimisations

2016-08-26 Thread Cedric BAIL
On Fri, Aug 26, 2016 at 5:44 PM, Carsten Haitzler  wrote:
> On Fri, 26 Aug 2016 12:15:57 -0700 Cedric BAIL  said:
>> On Fri, Aug 26, 2016 at 2:46 AM, Tom Hacohen  wrote:
>> > On 24/08/16 20:03, Cedric BAIL wrote:
>> >> On Wed, Aug 24, 2016 at 2:24 AM, Tom Hacohen  wrote:
>> >>> On 23/08/16 18:51, Cedric BAIL wrote:
>>  On Tue, Aug 23, 2016 at 3:31 AM, Tom Hacohen  
>>  wrote:



>> > We can also store a pointer to the array in a hash table with the key
>> > being some sort of a hash of the array in order to do some 
>> > deduplication
>> > afterwards (point to the same arrays, but obviously different private
>> > data, so that would still be duplicated) if we feel it's needed. It
>> > probably won't save as much though and will have some running costs.
>> 
>>  For anything < 16 entries, I bet that a hash table will be slower than
>>  walking an array. Don't forget you need to compute the hash key, jump
>>  in an array, walk down a rbtree and finally iterate over a list. Hash
>>  are good for very large number of object, not for small number.
>> >>>
>> >>> That was an optimisation that I just threw out there to the world, but I
>> >>> believe you misunderstood me. I didn't mean we create a hash table for
>> >>> calling events, it was for saving memory and deduplicating event
>> >>> callbacks (essentially callback arrays automatically). This is only done
>> >>> on callback add/del.
>> >>
>> >> Indeed I missunderstood your intent. Still this will increase the cost
>> >> of insertion for no benefit in my opinion. See below.
>> >
>> > Again, this is a side comment, not that important.
>>
>> I have discovered that this is an important use case actually. We do
>> insert and remove callback quite a lot now that animator is an event
>> on an object. We do already spend nearly as much time doing
>> efl_event_callback_priority_add (0.90% for 110 000 calls),
>> efl_event_callback_array_priority_add (0.49% for 110 000 calls),
>> efl_event_callback_array_del (0.27% for 40 000 calls) and
>> efl_event_callback_del (0.93% for 110 000 calls). The cost of adding
>> and destroying events is not negligeable. It is, cumulatively,
>> actually higher than our current cost for calling events.
>
> something to optimize. :) having priority doesn't help us a lot here as we 
> have
> to walk a lot of list nodes and pull in a lot of l1 cache pages to do the
> insert given priority. if we sort also by event ptr it wont really matter.
>
> if we move to an array of event ptrs for events then we likely will pull in 
> all
> existing event id's in 1 l1 cache line fetch and this will very likely make
> insert and remove a lot faster.

Uh ? Removing from an array is costly as you have to deal with hole !
That's why we rely on list so much. How do you think we can deal
faster with array ? Doing an sorted insert on priority add, should
actually help to make destruction faster... at likely the cost of
increasing insertion... Not so sure there is much to be done there. I
may be wrong, but maybe an easy way to speedup allocation/destruction
of single callback is to have an allocation cache using Eina_Trash,
will try that next week. I don't have the benchmark result at hand
anymore.

> to do this we have to redesign our internal structures. i think with callback
> arrays we should also have the callback array content duplicated internally in
> more optimal memory layout thus it no longer NEEDS to be "static const" and
> stay around for as long as the object does. we will spend some more cycles
> hashing the cb array and looking up an existing copy and refcounting it. yes
> this means it will be another l1 cache fetch possible, tho its a shared bit of
> mem so if use often it may be in l1 cache already unlike the per-object cb
> arrays.

Outch deduplication will make insertion and removal way more costly !
Right now, we just prepend in a short list, going with deduplication
will be imply hashing and compare. This is clearly going to have a
high cost for all add/del operation. We can't reduce the call to
registering events on object, we can on emitting events. This
prioritize the design decision we are making here.



>> As for the hash and compare, this is a reference to your previous
>> comment saying that you could deduplicate the callbacks after they are
>> inserted. I don't see how you can implement a deduplication without a
>> hash and compare. And this likely as to be done at insertion time (and
>> at removal too).
>
> the de-dup would likely use a hash to avoid larger more expensive "memcmp's"
> against several instances. also by duplicating we make this doable via 
> bindings
> which is currently is not (except c++). ALSO with a relayout we likely can 
> make
> insert and deletion faster.

Uh ? How ? Right now, it is just walking a short list and inserting
in. There is no hash 

Re: [E-devel] Callback arrays and callback invocation optimisations

2016-08-26 Thread The Rasterman
On Fri, 26 Aug 2016 12:15:57 -0700 Cedric BAIL  said:

> On Fri, Aug 26, 2016 at 2:46 AM, Tom Hacohen  wrote:
> > On 24/08/16 20:03, Cedric BAIL wrote:
> >> On Wed, Aug 24, 2016 at 2:24 AM, Tom Hacohen  wrote:
> >>> On 23/08/16 18:51, Cedric BAIL wrote:
>  On Tue, Aug 23, 2016 at 3:31 AM, Tom Hacohen  
>  wrote:
> >>
> >> 
> >>
> > However, while they provide a nice memory improvement, they have been
> > hampering many optimisation strategies that would make callback
> > invocation significantly faster. Furthermore, maybe (not sure), we can
> > automatically de-duplicate event lists internally (more on that in a
> > moment). With that being said, there is a way we can maybe keep array
> > callbacks with some limitations.
> 
>  Do you have a case where performance are impacted by callback today ?
>  I have found that we usually have a very small number of callbacks
>  (likely in an array this days) and when speed did really matter it was
>  just best to not trigger the callback at all (That's why we have this
>  code in many place that count if any callback has been registered).
> >>>
> >>> It always showed up in callgrind. Obviously after you did your changes
> >>> that improved things, because you essentially just don't call that code,
> >>> but having to do this everywhere is a bit of a pain, especially if we
> >>> can just make callbacks fast on their own.
> >>>
> >>> Callback_call takes around 1.5% in the efl atm. Though if we remove the
> >>> not-call optimisations it would be much more again. I wonder if we can
> >>> reach good results without it.
> >>
> >> When genlist is scrolling, just calling a function is costly as we end
> >> up calling it million times, litterally. I seriously doubt it is
> >> possible.
> >
> > And yet, this is one of the functions that stand out and not others that
> > are "just called".
> 
> Can you share your test case ? I can't reproduce a way where it does
> stand out. It barelly register at 1% in my benchmark. It is maybe in
> position 20 of the most costly function call. After malloc, free,
> mutex_lock/unlock, eina_hash_superfast (I think I can optimize that
> one easily).

speaking of malloc/free overhead:

https://www.facebook.com/notes/facebook-engineering/scalable-memory-allocation-using-jemalloc/480222803919/

scroll to the bottom for some graphs. and the graphs are not the raw
malloc/free performance itself but overall "webserver throughput" as a result
of switching malloc implementations. we might not get as much a boost...

it smells like it'd be good to have our own allocators for as much memory as we
can and maybe ship jemalloc with eina. no - don't REPLACE libc malloc/free
entirely for a process, but at least our own usage of it where our external
api doesn't require someone free something with libc free().


> >>>  From my tests back when I was optimising callback invocation, we had
> >>> around 5 callbacks on average on objects with non-zero number of
> >>> registered callbacks with a maximum number of around 12 if my memory
> >>> serves, so this could potentially make callback calls so fast any
> >>> optimisations won't matter.
> >>
> >> This number where from before callbacks array. I am seriously
> >> interested to know todays number. Also an improved statistic would be
> >> to know how many callbacks are walked over in the most called case and
> >> how many of those callbacks are actually in an array already.
> >>
> >> 
> >
> > Callback array or not, you still end up walking all of the callbacks...
> 
> Sure, but if you only have mostly one sorted callbacks array, that
> doesn't really matter. You will be faster in the main use case.
> 
> > We can also store a pointer to the array in a hash table with the key
> > being some sort of a hash of the array in order to do some deduplication
> > afterwards (point to the same arrays, but obviously different private
> > data, so that would still be duplicated) if we feel it's needed. It
> > probably won't save as much though and will have some running costs.
> 
>  For anything < 16 entries, I bet that a hash table will be slower than
>  walking an array. Don't forget you need to compute the hash key, jump
>  in an array, walk down a rbtree and finally iterate over a list. Hash
>  are good for very large number of object, not for small number.
> >>>
> >>> That was an optimisation that I just threw out there to the world, but I
> >>> believe you misunderstood me. I didn't mean we create a hash table for
> >>> calling events, it was for saving memory and deduplicating event
> >>> callbacks (essentially callback arrays automatically). This is only done
> >>> on callback add/del.
> >>
> >> Indeed I missunderstood your intent. Still this will increase the cost
> >> of insertion for no benefit in my opinion. See below.
> >
> > Again, 

Re: [E-devel] Callback arrays and callback invocation optimisations

2016-08-26 Thread Cedric BAIL
On Fri, Aug 26, 2016 at 2:46 AM, Tom Hacohen  wrote:
> On 24/08/16 20:03, Cedric BAIL wrote:
>> On Wed, Aug 24, 2016 at 2:24 AM, Tom Hacohen  wrote:
>>> On 23/08/16 18:51, Cedric BAIL wrote:
 On Tue, Aug 23, 2016 at 3:31 AM, Tom Hacohen  wrote:
>>
>> 
>>
> However, while they provide a nice memory improvement, they have been
> hampering many optimisation strategies that would make callback
> invocation significantly faster. Furthermore, maybe (not sure), we can
> automatically de-duplicate event lists internally (more on that in a
> moment). With that being said, there is a way we can maybe keep array
> callbacks with some limitations.

 Do you have a case where performance are impacted by callback today ?
 I have found that we usually have a very small number of callbacks
 (likely in an array this days) and when speed did really matter it was
 just best to not trigger the callback at all (That's why we have this
 code in many place that count if any callback has been registered).
>>>
>>> It always showed up in callgrind. Obviously after you did your changes
>>> that improved things, because you essentially just don't call that code,
>>> but having to do this everywhere is a bit of a pain, especially if we
>>> can just make callbacks fast on their own.
>>>
>>> Callback_call takes around 1.5% in the efl atm. Though if we remove the
>>> not-call optimisations it would be much more again. I wonder if we can
>>> reach good results without it.
>>
>> When genlist is scrolling, just calling a function is costly as we end
>> up calling it million times, litterally. I seriously doubt it is
>> possible.
>
> And yet, this is one of the functions that stand out and not others that
> are "just called".

Can you share your test case ? I can't reproduce a way where it does
stand out. It barelly register at 1% in my benchmark. It is maybe in
position 20 of the most costly function call. After malloc, free,
mutex_lock/unlock, eina_hash_superfast (I think I can optimize that
one easily).

>>>  From my tests back when I was optimising callback invocation, we had
>>> around 5 callbacks on average on objects with non-zero number of
>>> registered callbacks with a maximum number of around 12 if my memory
>>> serves, so this could potentially make callback calls so fast any
>>> optimisations won't matter.
>>
>> This number where from before callbacks array. I am seriously
>> interested to know todays number. Also an improved statistic would be
>> to know how many callbacks are walked over in the most called case and
>> how many of those callbacks are actually in an array already.
>>
>> 
>
> Callback array or not, you still end up walking all of the callbacks...

Sure, but if you only have mostly one sorted callbacks array, that
doesn't really matter. You will be faster in the main use case.

> We can also store a pointer to the array in a hash table with the key
> being some sort of a hash of the array in order to do some deduplication
> afterwards (point to the same arrays, but obviously different private
> data, so that would still be duplicated) if we feel it's needed. It
> probably won't save as much though and will have some running costs.

 For anything < 16 entries, I bet that a hash table will be slower than
 walking an array. Don't forget you need to compute the hash key, jump
 in an array, walk down a rbtree and finally iterate over a list. Hash
 are good for very large number of object, not for small number.
>>>
>>> That was an optimisation that I just threw out there to the world, but I
>>> believe you misunderstood me. I didn't mean we create a hash table for
>>> calling events, it was for saving memory and deduplicating event
>>> callbacks (essentially callback arrays automatically). This is only done
>>> on callback add/del.
>>
>> Indeed I missunderstood your intent. Still this will increase the cost
>> of insertion for no benefit in my opinion. See below.
>
> Again, this is a side comment, not that important.

I have discovered that this is an important use case actually. We do
insert and remove callback quite a lot now that animator is an event
on an object. We do already spend nearly as much time doing
efl_event_callback_priority_add (0.90% for 110 000 calls),
efl_event_callback_array_priority_add (0.49% for 110 000 calls),
efl_event_callback_array_del (0.27% for 40 000 calls) and
efl_event_callback_del (0.93% for 110 000 calls). The cost of adding
and destroying events is not negligeable. It is, cumulatively,
actually higher than our current cost for calling events.

I am not really sure of the exact cost structure, but it seems that
the CALLBACK_ADD and CALLBACK_DEL are responsible for a fair amount of
that cost. Maybe optimizing those would be useful. I can see the event
cather being call around 350 000 times with a total cost around 0.15%,
but that 

Re: [E-devel] Callback arrays and callback invocation optimisations

2016-08-26 Thread Tom Hacohen
On 26/08/16 11:21, David Seikel wrote:
> On Fri, 26 Aug 2016 10:46:48 +0100 Tom Hacohen 
> wrote:
>
>> a shitload of times. If I remember correctly, _efl_object_call_end is
>> one line: _eo_unref(obj). And that one is essentially if
>> (--(obj->ref) == 0) and then just returns in 99.99% of the cases. Not
>> a lot to optimise there. :)
>
> Make it a macro, done, it's optimised.  B-)

It's inlined. IIRC I checked, and the CPU usage in _efl_object_call_end 
is almost only in the function invocation itself!

--
Tom


--
___
enlightenment-devel mailing list
enlightenment-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/enlightenment-devel


Re: [E-devel] Callback arrays and callback invocation optimisations

2016-08-26 Thread David Seikel
On Fri, 26 Aug 2016 10:46:48 +0100 Tom Hacohen 
wrote:

> a shitload of times. If I remember correctly, _efl_object_call_end is
> one line: _eo_unref(obj). And that one is essentially if
> (--(obj->ref) == 0) and then just returns in 99.99% of the cases. Not
> a lot to optimise there. :)

Make it a macro, done, it's optimised.  B-)

-- 
A big old stinking pile of genius that no one wants
coz there are too many silver coated monkeys in the world.


signature.asc
Description: PGP signature
--
___
enlightenment-devel mailing list
enlightenment-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/enlightenment-devel


Re: [E-devel] Callback arrays and callback invocation optimisations

2016-08-26 Thread Tom Hacohen
On 24/08/16 20:03, Cedric BAIL wrote:
> On Wed, Aug 24, 2016 at 2:24 AM, Tom Hacohen  wrote:
>> On 23/08/16 18:51, Cedric BAIL wrote:
>>> On Tue, Aug 23, 2016 at 3:31 AM, Tom Hacohen  wrote:
>
> 
>
 However, while they provide a nice memory improvement, they have been
 hampering many optimisation strategies that would make callback
 invocation significantly faster. Furthermore, maybe (not sure), we can
 automatically de-duplicate event lists internally (more on that in a
 moment). With that being said, there is a way we can maybe keep array
 callbacks with some limitations.
>>>
>>> Do you have a case where performance are impacted by callback today ?
>>> I have found that we usually have a very small number of callbacks
>>> (likely in an array this days) and when speed did really matter it was
>>> just best to not trigger the callback at all (That's why we have this
>>> code in many place that count if any callback has been registered).
>>
>> It always showed up in callgrind. Obviously after you did your changes
>> that improved things, because you essentially just don't call that code,
>> but having to do this everywhere is a bit of a pain, especially if we
>> can just make callbacks fast on their own.
>>
>> Callback_call takes around 1.5% in the efl atm. Though if we remove the
>> not-call optimisations it would be much more again. I wonder if we can
>> reach good results without it.
>
> When genlist is scrolling, just calling a function is costly as we end
> up calling it million times, litterally. I seriously doubt it is
> possible.
>

And yet, this is one of the functions that stand out and not others that 
are "just called".

>>  From my tests back when I was optimising callback invocation, we had
>> around 5 callbacks on average on objects with non-zero number of
>> registered callbacks with a maximum number of around 12 if my memory
>> serves, so this could potentially make callback calls so fast any
>> optimisations won't matter.
>
> This number where from before callbacks array. I am seriously
> interested to know todays number. Also an improved statistic would be
> to know how many callbacks are walked over in the most called case and
> how many of those callbacks are actually in an array already.
>
> 

Callback array or not, you still end up walking all of the callbacks...

>
 We can also store a pointer to the array in a hash table with the key
 being some sort of a hash of the array in order to do some deduplication
 afterwards (point to the same arrays, but obviously different private
 data, so that would still be duplicated) if we feel it's needed. It
 probably won't save as much though and will have some running costs.
>>>
>>> For anything < 16 entries, I bet that a hash table will be slower than
>>> walking an array. Don't forget you need to compute the hash key, jump
>>> in an array, walk down a rbtree and finally iterate over a list. Hash
>>> are good for very large number of object, not for small number.
>>
>> That was an optimisation that I just threw out there to the world, but I
>> believe you misunderstood me. I didn't mean we create a hash table for
>> calling events, it was for saving memory and deduplicating event
>> callbacks (essentially callback arrays automatically). This is only done
>> on callback add/del.
>
> Indeed I missunderstood your intent. Still this will increase the cost
> of insertion for no benefit in my opinion. See below.

Again, this is a side comment, not that important.

>
 The last idea is to keep callback arrays, but kind of limit their scope.
 The problem (or at least one of them) is that callback arrays support
 setting a priority which means calling them needs to be in between the
 calls to normal callbacks. This adds a lot of complexity (this is a very
 hot path, even a simple if is complexity, but this adds more). If we
 define that all callback arrays are always the lowest priority (called
 last), which in practice will have almost zero impact if at all, we can
 just keep them, and just call them after we do the normal callback calls
 (if they exist). We can even optimise further by not making the arrays
 constant, and thus letting us sort them and then run the same algorithm
 mentioned above for searching. This is probably the most acceptable
 compromise, though I'm not sure if it'll block any future optimisation
 attempts that I'm not able to foresee.
>>>
>>> No ! Array are only useful if they are constant ! That is the only way
>>> to share them accross all instance of object. Their size being
>>> ridiculously small, I bet you won't win anything in reordering them.
>>> And if you really want to reorder them, you can do that once at
>>> creation time in the inline function that create them as defined in
>>> Eo.h.
>>
>> That is absolutely untrue. You can reorder them where they are created
>> (like you 

Re: [E-devel] Callback arrays and callback invocation optimisations

2016-08-24 Thread Cedric BAIL
On Wed, Aug 24, 2016 at 2:24 AM, Tom Hacohen  wrote:
> On 23/08/16 18:51, Cedric BAIL wrote:
>> On Tue, Aug 23, 2016 at 3:31 AM, Tom Hacohen  wrote:



>>> However, while they provide a nice memory improvement, they have been
>>> hampering many optimisation strategies that would make callback
>>> invocation significantly faster. Furthermore, maybe (not sure), we can
>>> automatically de-duplicate event lists internally (more on that in a
>>> moment). With that being said, there is a way we can maybe keep array
>>> callbacks with some limitations.
>>
>> Do you have a case where performance are impacted by callback today ?
>> I have found that we usually have a very small number of callbacks
>> (likely in an array this days) and when speed did really matter it was
>> just best to not trigger the callback at all (That's why we have this
>> code in many place that count if any callback has been registered).
>
> It always showed up in callgrind. Obviously after you did your changes
> that improved things, because you essentially just don't call that code,
> but having to do this everywhere is a bit of a pain, especially if we
> can just make callbacks fast on their own.
>
> Callback_call takes around 1.5% in the efl atm. Though if we remove the
> not-call optimisations it would be much more again. I wonder if we can
> reach good results without it.

When genlist is scrolling, just calling a function is costly as we end
up calling it million times, litterally. I seriously doubt it is
possible.

>  From my tests back when I was optimising callback invocation, we had
> around 5 callbacks on average on objects with non-zero number of
> registered callbacks with a maximum number of around 12 if my memory
> serves, so this could potentially make callback calls so fast any
> optimisations won't matter.

This number where from before callbacks array. I am seriously
interested to know todays number. Also an improved statistic would be
to know how many callbacks are walked over in the most called case and
how many of those callbacks are actually in an array already.



>>> We can also store a pointer to the array in a hash table with the key
>>> being some sort of a hash of the array in order to do some deduplication
>>> afterwards (point to the same arrays, but obviously different private
>>> data, so that would still be duplicated) if we feel it's needed. It
>>> probably won't save as much though and will have some running costs.
>>
>> For anything < 16 entries, I bet that a hash table will be slower than
>> walking an array. Don't forget you need to compute the hash key, jump
>> in an array, walk down a rbtree and finally iterate over a list. Hash
>> are good for very large number of object, not for small number.
>
> That was an optimisation that I just threw out there to the world, but I
> believe you misunderstood me. I didn't mean we create a hash table for
> calling events, it was for saving memory and deduplicating event
> callbacks (essentially callback arrays automatically). This is only done
> on callback add/del.

Indeed I missunderstood your intent. Still this will increase the cost
of insertion for no benefit in my opinion. See below.

>>> The last idea is to keep callback arrays, but kind of limit their scope.
>>> The problem (or at least one of them) is that callback arrays support
>>> setting a priority which means calling them needs to be in between the
>>> calls to normal callbacks. This adds a lot of complexity (this is a very
>>> hot path, even a simple if is complexity, but this adds more). If we
>>> define that all callback arrays are always the lowest priority (called
>>> last), which in practice will have almost zero impact if at all, we can
>>> just keep them, and just call them after we do the normal callback calls
>>> (if they exist). We can even optimise further by not making the arrays
>>> constant, and thus letting us sort them and then run the same algorithm
>>> mentioned above for searching. This is probably the most acceptable
>>> compromise, though I'm not sure if it'll block any future optimisation
>>> attempts that I'm not able to foresee.
>>
>> No ! Array are only useful if they are constant ! That is the only way
>> to share them accross all instance of object. Their size being
>> ridiculously small, I bet you won't win anything in reordering them.
>> And if you really want to reorder them, you can do that once at
>> creation time in the inline function that create them as defined in
>> Eo.h.
>
> That is absolutely untrue. You can reorder them where they are created
> (like you suggested), or reorder them when they are added and still
> share them. You'll only need to reorder once, after that, when they are
> in order, that's it. Const doesn't matter or help at all. Obviously
> you're expected not to change them.

If the array is not const, then you have to allocate it every time you
register it. This has a direct cost. Adding the fact 

Re: [E-devel] Callback arrays and callback invocation optimisations

2016-08-24 Thread Tom Hacohen
On 23/08/16 18:51, Cedric BAIL wrote:
> Hello,
>
> On Tue, Aug 23, 2016 at 3:31 AM, Tom Hacohen  wrote:
>> Callback arrays was an idea that was introduced a while back to save
>> memory. The idea came from the observation that in many cases we add a
>> few callbacks together to every object of the same type. For example,
>> for an elm widget, we may add "move, resize, hidden, mouse_down,
>> mouse_up" so instead of manually adding all of them separately and have
>> eo allocate memory for each (and a separate private data pointer for
>> each) we would just create a static array and pass that.
>> This leads (at least in theory) to quite significant savings in memory.
>> I haven't actually tested the change, as I didn't do it, Cedric, do you
>> have some numbers to share?
>
> I remember something along a few hundred kb on elementary_test. Edje
> using it was a real saver, it was ahead of any memory allocation at
> that point.

That's cool. That's why I suggested keeping them (to an extent) in my 
proposal.

>
>> However, while they provide a nice memory improvement, they have been
>> hampering many optimisation strategies that would make callback
>> invocation significantly faster. Furthermore, maybe (not sure), we can
>> automatically de-duplicate event lists internally (more on that in a
>> moment). With that being said, there is a way we can maybe keep array
>> callbacks with some limitations.
>
> Do you have a case where performance are impacted by callback today ?
> I have found that we usually have a very small number of callbacks
> (likely in an array this days) and when speed did really matter it was
> just best to not trigger the callback at all (That's why we have this
> code in many place that count if any callback has been registered).

It always showed up in callgrind. Obviously after you did your changes 
that improved things, because you essentially just don't call that code, 
but having to do this everywhere is a bit of a pain, especially if we 
can just make callbacks fast on their own.

Callback_call takes around 1.5% in the efl atm. Though if we remove the 
not-call optimisations it would be much more again. I wonder if we can 
reach good results without it.

 From my tests back when I was optimising callback invocation, we had 
around 5 callbacks on average on objects with non-zero number of 
registered callbacks with a maximum number of around 12 if my memory 
serves, so this could potentially make callback calls so fast any 
optimisations won't matter.

>
>> I discussed my ideas with Carsten on IRC and I believe we have reached
>> something that would be significantly faster and better.
>>
>> Assuming callback arrays are no more:
>> First of all we will change the internal callbacks structure from a list
>> to array. The array will be sorted based on the address of the event
>> structure first, and priority second. The array will only include the
>> pointer to the callback structure, nothing else. All of the rest of the
>> information needed will be stored in an array of the same size and the
>> same order.
>> This means that we will now be able to walk the array and stop once the
>> address of the event we are looking for is higher than the address of
>> the address of the current event. Meaning we will walk only 50% of the
>> array on average, and the array itself will most likely be in the same
>> cache line. After we find the correct one, we will also fetch the rest
>> of the data based on the index. This will lead to blazing fast
>> iterations of callbacks without the need to optimise at all.
>
> Callback array are already likely to be in the same cache line. Them
> being const means they are automatically shared with all instances. It
> does impact our memory usage, but also our cache usage. For example in
> edje, we do have object in an object that trigger a callback and get
> propagated to its parent. With array callbacks it is exactly the same
> memory used in both object. So it doesn't only save memory, but I am
> pretty sure it does result in the same kind of speed up you are
> looking at.

Yes, callback arrays are good like that, that's why I said below that it 
will essentially be the same (and the same code). I have a feeling based 
on this comment and the one above that you read my emails and reply 
immediately instead of reading the whole thing first.

>
>> We can also store a pointer to the array in a hash table with the key
>> being some sort of a hash of the array in order to do some deduplication
>> afterwards (point to the same arrays, but obviously different private
>> data, so that would still be duplicated) if we feel it's needed. It
>> probably won't save as much though and will have some running costs.
>
> For anything < 16 entries, I bet that a hash table will be slower than
> walking an array. Don't forget you need to compute the hash key, jump
> in an array, walk down a rbtree and finally iterate over a list. Hash
> are good for very 

Re: [E-devel] Callback arrays and callback invocation optimisations

2016-08-24 Thread The Rasterman
On Tue, 23 Aug 2016 10:51:05 -0700 Cedric BAIL  said:

> Hello,
> 
> On Tue, Aug 23, 2016 at 3:31 AM, Tom Hacohen  wrote:
> > Callback arrays was an idea that was introduced a while back to save
> > memory. The idea came from the observation that in many cases we add a
> > few callbacks together to every object of the same type. For example,
> > for an elm widget, we may add "move, resize, hidden, mouse_down,
> > mouse_up" so instead of manually adding all of them separately and have
> > eo allocate memory for each (and a separate private data pointer for
> > each) we would just create a static array and pass that.
> > This leads (at least in theory) to quite significant savings in memory.
> > I haven't actually tested the change, as I didn't do it, Cedric, do you
> > have some numbers to share?
> 
> I remember something along a few hundred kb on elementary_test. Edje
> using it was a real saver, it was ahead of any memory allocation at
> that point.

that's significant. it's still kind of a problem for api and optimizing call
overhead.

> > However, while they provide a nice memory improvement, they have been
> > hampering many optimisation strategies that would make callback
> > invocation significantly faster. Furthermore, maybe (not sure), we can
> > automatically de-duplicate event lists internally (more on that in a
> > moment). With that being said, there is a way we can maybe keep array
> > callbacks with some limitations.
> 
> Do you have a case where performance are impacted by callback today ?
> I have found that we usually have a very small number of callbacks
> (likely in an array this days) and when speed did really matter it was
> just best to not trigger the callback at all (That's why we have this
> code in many place that count if any callback has been registered).
> 
> > I discussed my ideas with Carsten on IRC and I believe we have reached
> > something that would be significantly faster and better.
> >
> > Assuming callback arrays are no more:
> > First of all we will change the internal callbacks structure from a list
> > to array. The array will be sorted based on the address of the event
> > structure first, and priority second. The array will only include the
> > pointer to the callback structure, nothing else. All of the rest of the
> > information needed will be stored in an array of the same size and the
> > same order.
> > This means that we will now be able to walk the array and stop once the
> > address of the event we are looking for is higher than the address of
> > the address of the current event. Meaning we will walk only 50% of the
> > array on average, and the array itself will most likely be in the same
> > cache line. After we find the correct one, we will also fetch the rest
> > of the data based on the index. This will lead to blazing fast
> > iterations of callbacks without the need to optimise at all.
> 
> Callback array are already likely to be in the same cache line. Them

well 4 items on 64bit can fit in a cacheline - i think if we restructures it we
could fit 8 items in so it likely would be more optimal if we redid the data
arrangement. so either we change api and abi OR we make copies of callback
arrays internally and we can hash/memcmmp and share them across all objects.
this is also slightly cleaner as the array doesnt have to be kept in memory by
the caller for all objects possibly using it. efl can take care of it. if we
duplicate it internally we can also reformat it internally to be more optimal
like tasn suggests - stuffing cb arrays at the end of regular cb's.

> being const means they are automatically shared with all instances. It
> does impact our memory usage, but also our cache usage. For example in
> edje, we do have object in an object that trigger a callback and get
> propagated to its parent. With array callbacks it is exactly the same
> memory used in both object. So it doesn't only save memory, but I am
> pretty sure it does result in the same kind of speed up you are
> looking at.

yeah. but as above. we can get the same benefits if we de-dup internally and
make copies of the incoming array from the api but reformat it to have better
l1 cache coherency. :)

> > We can also store a pointer to the array in a hash table with the key
> > being some sort of a hash of the array in order to do some deduplication
> > afterwards (point to the same arrays, but obviously different private
> > data, so that would still be duplicated) if we feel it's needed. It
> > probably won't save as much though and will have some running costs.
> 
> For anything < 16 entries, I bet that a hash table will be slower than
> walking an array. Don't forget you need to compute the hash key, jump
> in an array, walk down a rbtree and finally iterate over a list. Hash
> are good for very large number of object, not for small number.

no - he's talking about taking a list of callbacks and de-duplicating them into
a single 

Re: [E-devel] Callback arrays and callback invocation optimisations

2016-08-23 Thread Cedric BAIL
Hello,

On Tue, Aug 23, 2016 at 3:31 AM, Tom Hacohen  wrote:
> Callback arrays was an idea that was introduced a while back to save
> memory. The idea came from the observation that in many cases we add a
> few callbacks together to every object of the same type. For example,
> for an elm widget, we may add "move, resize, hidden, mouse_down,
> mouse_up" so instead of manually adding all of them separately and have
> eo allocate memory for each (and a separate private data pointer for
> each) we would just create a static array and pass that.
> This leads (at least in theory) to quite significant savings in memory.
> I haven't actually tested the change, as I didn't do it, Cedric, do you
> have some numbers to share?

I remember something along a few hundred kb on elementary_test. Edje
using it was a real saver, it was ahead of any memory allocation at
that point.

> However, while they provide a nice memory improvement, they have been
> hampering many optimisation strategies that would make callback
> invocation significantly faster. Furthermore, maybe (not sure), we can
> automatically de-duplicate event lists internally (more on that in a
> moment). With that being said, there is a way we can maybe keep array
> callbacks with some limitations.

Do you have a case where performance are impacted by callback today ?
I have found that we usually have a very small number of callbacks
(likely in an array this days) and when speed did really matter it was
just best to not trigger the callback at all (That's why we have this
code in many place that count if any callback has been registered).

> I discussed my ideas with Carsten on IRC and I believe we have reached
> something that would be significantly faster and better.
>
> Assuming callback arrays are no more:
> First of all we will change the internal callbacks structure from a list
> to array. The array will be sorted based on the address of the event
> structure first, and priority second. The array will only include the
> pointer to the callback structure, nothing else. All of the rest of the
> information needed will be stored in an array of the same size and the
> same order.
> This means that we will now be able to walk the array and stop once the
> address of the event we are looking for is higher than the address of
> the address of the current event. Meaning we will walk only 50% of the
> array on average, and the array itself will most likely be in the same
> cache line. After we find the correct one, we will also fetch the rest
> of the data based on the index. This will lead to blazing fast
> iterations of callbacks without the need to optimise at all.

Callback array are already likely to be in the same cache line. Them
being const means they are automatically shared with all instances. It
does impact our memory usage, but also our cache usage. For example in
edje, we do have object in an object that trigger a callback and get
propagated to its parent. With array callbacks it is exactly the same
memory used in both object. So it doesn't only save memory, but I am
pretty sure it does result in the same kind of speed up you are
looking at.

> We can also store a pointer to the array in a hash table with the key
> being some sort of a hash of the array in order to do some deduplication
> afterwards (point to the same arrays, but obviously different private
> data, so that would still be duplicated) if we feel it's needed. It
> probably won't save as much though and will have some running costs.

For anything < 16 entries, I bet that a hash table will be slower than
walking an array. Don't forget you need to compute the hash key, jump
in an array, walk down a rbtree and finally iterate over a list. Hash
are good for very large number of object, not for small number.

> The last idea is to keep callback arrays, but kind of limit their scope.
> The problem (or at least one of them) is that callback arrays support
> setting a priority which means calling them needs to be in between the
> calls to normal callbacks. This adds a lot of complexity (this is a very
> hot path, even a simple if is complexity, but this adds more). If we
> define that all callback arrays are always the lowest priority (called
> last), which in practice will have almost zero impact if at all, we can
> just keep them, and just call them after we do the normal callback calls
> (if they exist). We can even optimise further by not making the arrays
> constant, and thus letting us sort them and then run the same algorithm
> mentioned above for searching. This is probably the most acceptable
> compromise, though I'm not sure if it'll block any future optimisation
> attempts that I'm not able to foresee.

No ! Array are only useful if they are constant ! That is the only way
to share them accross all instance of object. Their size being
ridiculously small, I bet you won't win anything in reordering them.
And if you really want to reorder them, you can do that 

Re: [E-devel] Callback arrays and callback invocation optimisations

2016-08-23 Thread Daniel Hirt
Hi

On 08/23/2016 03:59 PM, Gustavo Sverzut Barbieri wrote:
> On Tue, Aug 23, 2016 at 7:31 AM, Tom Hacohen  wrote:
>>
>> Hey,
>>
>> Callback arrays was an idea that was introduced a while back to save
>> memory.
> 
> ohhh, I guessed it was to improve usability, saving many "add" calls
> at the end of an "eo_add()"... at least it's how I use them.
> 

Convenience aside, it does also save the storage of data that's common
for all the callbacks and the priority.

> Take a look in GIT to see if most usages are like mine or what you
> describe. Based on that you can take some proper action.
> 
> My gut feeling is that the cacheline improvement is a big thing, since
> number of callbacks tend to be small and if we can keep this compact,
> it's a nice improvement to reach the target. Once there, bit more
> expensive access doesn't matter that much (since it'll be only one).
> 
> hashes or lookup optimizations are usually not worth for small number
> of entries... the cost of preparing the lookup usually doesn't pay
> off, not even a bsearch() usually pays...
> 
> 

--
___
enlightenment-devel mailing list
enlightenment-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/enlightenment-devel


Re: [E-devel] Callback arrays and callback invocation optimisations

2016-08-23 Thread Gustavo Sverzut Barbieri
On Tue, Aug 23, 2016 at 7:31 AM, Tom Hacohen  wrote:
>
> Hey,
>
> Callback arrays was an idea that was introduced a while back to save
> memory.

ohhh, I guessed it was to improve usability, saving many "add" calls
at the end of an "eo_add()"... at least it's how I use them.

Take a look in GIT to see if most usages are like mine or what you
describe. Based on that you can take some proper action.

My gut feeling is that the cacheline improvement is a big thing, since
number of callbacks tend to be small and if we can keep this compact,
it's a nice improvement to reach the target. Once there, bit more
expensive access doesn't matter that much (since it'll be only one).

hashes or lookup optimizations are usually not worth for small number
of entries... the cost of preparing the lookup usually doesn't pay
off, not even a bsearch() usually pays...


-- 
Gustavo Sverzut Barbieri
--
Mobile: +55 (16) 99354-9890

--
___
enlightenment-devel mailing list
enlightenment-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/enlightenment-devel