On Fri, Aug 26, 2016 at 2:46 AM, Tom Hacohen <t...@osg.samsung.com> wrote:
> On 24/08/16 20:03, Cedric BAIL wrote:
>> On Wed, Aug 24, 2016 at 2:24 AM, Tom Hacohen <t...@osg.samsung.com> wrote:
>>> On 23/08/16 18:51, Cedric BAIL wrote:
>>>> On Tue, Aug 23, 2016 at 3:31 AM, Tom Hacohen <t...@osg.samsung.com> wrote:
>>
>> <snip>
>>
>>>>> However, while they provide a nice memory improvement, they have been
>>>>> hampering many optimisation strategies that would make callback
>>>>> invocation significantly faster. Furthermore, maybe (not sure), we can
>>>>> automatically de-duplicate event lists internally (more on that in a
>>>>> moment). With that being said, there is a way we can maybe keep array
>>>>> callbacks with some limitations.
>>>>
>>>> Do you have a case where performance are impacted by callback today ?
>>>> I have found that we usually have a very small number of callbacks
>>>> (likely in an array this days) and when speed did really matter it was
>>>> just best to not trigger the callback at all (That's why we have this
>>>> code in many place that count if any callback has been registered).
>>>
>>> It always showed up in callgrind. Obviously after you did your changes
>>> that improved things, because you essentially just don't call that code,
>>> but having to do this everywhere is a bit of a pain, especially if we
>>> can just make callbacks fast on their own.
>>>
>>> Callback_call takes around 1.5% in the efl atm. Though if we remove the
>>> not-call optimisations it would be much more again. I wonder if we can
>>> reach good results without it.
>>
>> When genlist is scrolling, just calling a function is costly as we end
>> up calling it million times, litterally. I seriously doubt it is
>> possible.
>
> And yet, this is one of the functions that stand out and not others that
> are "just called".

Can you share your test case ? I can't reproduce a way where it does
stand out. It barelly register at 1% in my benchmark. It is maybe in
position 20 of the most costly function call. After malloc, free,
mutex_lock/unlock, eina_hash_superfast (I think I can optimize that
one easily).

>>>  From my tests back when I was optimising callback invocation, we had
>>> around 5 callbacks on average on objects with non-zero number of
>>> registered callbacks with a maximum number of around 12 if my memory
>>> serves, so this could potentially make callback calls so fast any
>>> optimisations won't matter.
>>
>> This number where from before callbacks array. I am seriously
>> interested to know todays number. Also an improved statistic would be
>> to know how many callbacks are walked over in the most called case and
>> how many of those callbacks are actually in an array already.
>>
>> <snip>
>
> Callback array or not, you still end up walking all of the callbacks...

Sure, but if you only have mostly one sorted callbacks array, that
doesn't really matter. You will be faster in the main use case.

>>>>> We can also store a pointer to the array in a hash table with the key
>>>>> being some sort of a hash of the array in order to do some deduplication
>>>>> afterwards (point to the same arrays, but obviously different private
>>>>> data, so that would still be duplicated) if we feel it's needed. It
>>>>> probably won't save as much though and will have some running costs.
>>>>
>>>> For anything < 16 entries, I bet that a hash table will be slower than
>>>> walking an array. Don't forget you need to compute the hash key, jump
>>>> in an array, walk down a rbtree and finally iterate over a list. Hash
>>>> are good for very large number of object, not for small number.
>>>
>>> That was an optimisation that I just threw out there to the world, but I
>>> believe you misunderstood me. I didn't mean we create a hash table for
>>> calling events, it was for saving memory and deduplicating event
>>> callbacks (essentially callback arrays automatically). This is only done
>>> on callback add/del.
>>
>> Indeed I missunderstood your intent. Still this will increase the cost
>> of insertion for no benefit in my opinion. See below.
>
> Again, this is a side comment, not that important.

I have discovered that this is an important use case actually. We do
insert and remove callback quite a lot now that animator is an event
on an object. We do already spend nearly as much time doing
efl_event_callback_priority_add (0.90% for 110 000 calls),
efl_event_callback_array_priority_add (0.49% for 110 000 calls),
efl_event_callback_array_del (0.27% for 40 000 calls) and
efl_event_callback_del (0.93% for 110 000 calls). The cost of adding
and destroying events is not negligeable. It is, cumulatively,
actually higher than our current cost for calling events.

I am not really sure of the exact cost structure, but it seems that
the CALLBACK_ADD and CALLBACK_DEL are responsible for a fair amount of
that cost. Maybe optimizing those would be useful. I can see the event
cather being call around 350 000 times with a total cost around 0.15%,
but that seems too low to explain all the overall cost. So not sure as
I said, need more investigation.

>>>>> The last idea is to keep callback arrays, but kind of limit their scope.
>>>>> The problem (or at least one of them) is that callback arrays support
>>>>> setting a priority which means calling them needs to be in between the
>>>>> calls to normal callbacks. This adds a lot of complexity (this is a very
>>>>> hot path, even a simple if is complexity, but this adds more). If we
>>>>> define that all callback arrays are always the lowest priority (called
>>>>> last), which in practice will have almost zero impact if at all, we can
>>>>> just keep them, and just call them after we do the normal callback calls
>>>>> (if they exist). We can even optimise further by not making the arrays
>>>>> constant, and thus letting us sort them and then run the same algorithm
>>>>> mentioned above for searching. This is probably the most acceptable
>>>>> compromise, though I'm not sure if it'll block any future optimisation
>>>>> attempts that I'm not able to foresee.
>>>>
>>>> No ! Array are only useful if they are constant ! That is the only way
>>>> to share them accross all instance of object. Their size being
>>>> ridiculously small, I bet you won't win anything in reordering them.
>>>> And if you really want to reorder them, you can do that once at
>>>> creation time in the inline function that create them as defined in
>>>> Eo.h.
>>>
>>> That is absolutely untrue. You can reorder them where they are created
>>> (like you suggested), or reorder them when they are added and still
>>> share them. You'll only need to reorder once, after that, when they are
>>> in order, that's it. Const doesn't matter or help at all. Obviously
>>> you're expected not to change them.
>>
>> If the array is not const, then you have to allocate it every time you
>> register it. This has a direct cost. Adding the fact you have then to
>> sort it out, hash, compare and maybe free it. I seriously doubt the
>> wisdom of doing so.
>>
>> As said above, sort it at creation, add debug code that will warn if
>> inserting unsorted array (code that will be disabled in production)
>> and just improve walking on those sorted array. I bet that will be
>> enough of a speedup for our real use case if there is any (see below).
>
> Either you are missing something or I'm missing something. First of all,
> yes, better to sort on creation, we agree on that.
>
> Const or not, in both cases it's going to be a static array, so
> allocated once. I don't see const would change that. I also don't
> understand what you mean by hash and compare. I think you are confusing
> my previous optimisation suggestion (please strike it out of your
> memory), that has *nothing* to do with hashing callback arrays, at least
> not in this case.

Hum, if it is const, it means the called is not going to modify it and
he doesn't give up on the ownership of that memory. My understanding
is that when you remove const, you are going to take over the data.
Now, this could be considered a special API and it could considered an
ok behavior (not convinced) to do so and reuse the modified array
accross multiple call. Still that would be confusing and why would you
touch that array every next time you register it ? Once seems enough.

As for the hash and compare, this is a reference to your previous
comment saying that you could deduplicate the callbacks after they are
inserted. I don't see how you can implement a deduplication without a
hash and compare. And this likely as to be done at insertion time (and
at removal too).

>>>>> I'm not a huge fan of callback arrays, but if they do save the memory
>>>>> they claim to be saving, I see no problem with keeping a more limited
>>>>> version of them that let us optimise everything in the manner described
>>>>> above.
>>>>
>>>> I am not a huge fan of optimization without a clear real life case.
>>>> Please share number and scenario of when it does matters. I have seen
>>>> enough people wasting there time optimizing things that don't matters
>>>> that I really take it with a grain of salt if you are not showing real
>>>> life scenario. Sharing a callgrind trace or something along that line
>>>> would really help make your point here.
>>>
>>> As I said, it's ~1.5% of the efl cpu usage when scrolling around
>>> genlist. It also wastes our memory to have them support priority. And as
>>> your changes proved, there is a reason to minimise callback calls, so we
>>> already have a case, instead of letting everyone reimplement that
>>> counting, it's better to just make callback calls fast. As I said, the
>>> price is very small, all I'm asking for is removing priority from
>>> callback arrays and always assume they are the lowest priority.
>>
>> You realize that as an optimization, you are fighting not calling a
>> function, not walking an array, doing fetch and compare (even doing a
>> dichotomic search). Pretty sure the benefit of not triggering the
>> event will remain. Oh and there is plenty of case where, well, you
>> will still do the optionnal propagation, like for animator.
>>
>> As for benchmarking, I did a quick run of 'ELM_TEST_AUTOBOUNCE=300
>> valgrind --tool=callgrind elementary_test  -to genlist'. I see a 0.90%
>> of the time spend in efl_event_callback_call (~400 000 calls) and
>> 0.35% evas_object_event_callback_call (~500 000 calls). It is going to
>> be very very hard to win anything on that.
>>
>> I see also way bigger fish to fish for :
>>  - _efl_object_call_resolve 12.53%
>>  - efl_data_scope_get 7.75%
>>  - efl_isa 3.54%
>>  - _efl_object_call_end 2.26%
>>
>> If you manage to win 10% on any of those, you will have managed more
>> than if you reduce the cost of calling efl_event_callback_call to 0. I
>> am really not convinced that you are focusing on the right problem at
>> all here.
>
> As I said, the statement you are making now is not entirely fair. You
> are essentially saying:
> I found a very slow function that was showing up in our benchmarks. I
> stopped calling it in a few cases, and now it doesn't show anymore, so
> no need to optimise it.
>
> However, what happens when we use it again? Are we going to have to
> chase it all around and block calls like you did? Isn't it better to
> just make it "good enough for most cases" from the get go so next time
> someone uses it a lot it doesn't show up in our benchmarks?

I bet we are. We are already moving to create Eo object for event
info. eo_add/del is insanely costly compare to callback_call. We are
going to avoid as much as we can firing events with complex structure.
So yes, I do believe that you are trying to optimize the wrong thing.
Sort the array at creation, check that it is at insertion time, do a
fast walk on arrays, and be done with it. Seriously any more than that
is likely going to have more side effect, performance and memory
impact than you are thinking it will with absolutely no visible gain.

> "Bigger fish to fry" - I fried these fish a lot. Maybe there's still
> room for improvement, but if there is, not much. They are just called a
> shitload of times. If I remember correctly, _efl_object_call_end is one
> line: _eo_unref(obj). And that one is essentially if (--(obj->ref) == 0)
> and then just returns in 99.99% of the cases. Not a lot to optimise
> there. :)

Sure. I offer you a present. 15% speed improvement on resolve, 20%
speed improvement on scope_get, 50% speed improvement on efl_isa. Took
me longer to write this email and efl_event_callback_call is still
below 1%.

If you have time, I have not worked on optimizing edje_part_recalc, it
is a big contender for improvement ;-)

Have fun.
-- 
Cedric BAIL

------------------------------------------------------------------------------
_______________________________________________
enlightenment-devel mailing list
enlightenment-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/enlightenment-devel

Reply via email to