Re: Faster calls (again)

Arne Goedeke Wed, 08 Mar 2017 01:33:39 -0800

I think I managed to fix the last issue. I was somehow confusing things
and removed the locals from the stack before unlinking the stack frame.
This of course broke trampolines. I also ended up rebasing the branch to
get rid of the reverts I did at some point.


The current state passes the testsuite (the same tests as 8.1 at least).
Performance wise it is roughly where 8.1 is, except for map/automap
being significantly faster. There are some slowdowns currently, which
are due to me removing some fast paths from the F_CALL_OTHER opcode. I
will look into that.

I readded most of the tracing code, however, some of it is unfinished
and DTrace is probably broken. I have also not looked at PROFILING, yet,
that is probably also not right yet.

Sidenote: Profiling unfortunately does not work properly when fork()ing
because timers change. It might even crash when running with debug mode
because of that. But that is probably just a bug we need to fix.

Whats currently left on my list before proposing to merge it into 8.1/8.3

* Make sure the map/automap optimizations do not break in pathological
  cases (e.g. objects being destructed or similar).
* Maybe think about the API again (e.g. callsite_execute and
  callsite_return could be merged. same with
  callsite_init/callsite_set_args).

Otherwise I played around with adding frame caching to apply_array,
which looks promising performance wise. However, it takes some attention
to make sure the stack traces are always correct. This would be a good
test-case for caching frames in general.

Anyway, feedback welcome, as usual,

Arne


On 02/22/17 09:37, Arne Goedeke wrote:
> I am not quite sure, since I did not have the time to look into it, yet.
> My feeling is that callsite_reset() is currently broken, probably  when
> trampolines are used. Its probably easy to fix. I was also planning to
> write a couple of tests which try to cover all possible paths of the
> function call API. Having to run the full testsuite can be quite annoying..
> 
> I also started adding some benchmarks for function calls to the
> pike-benchmark repo. That might make it easier to tweak specific
> optimizations.
> 
> Arne
> 
> On 02/21/17 22:12, Martin Karlgren wrote:
>> Hi Arne,
>>
>> Alright. Any idea what the crash might be related to?
>>
>> I’ve pushed the marty/call_frames branch now. As mentioned, something breaks 
>> when precompiled bytecode is decoded, so many testsuite tests will segfault 
>> (since they are precompiled).
>>
>> Compiling --with-mc-stack-frames and running the very nice 
>> Debug.generate_perf_map() (previously implemented by TobiJ) should enable 
>> perf to extract what’s needed. I’ve used 
>> https://github.com/jrfonseca/gprof2dot 
>> <https://github.com/jrfonseca/gprof2dot> and 
>> http://www.brendangregg.com/FlameGraphs/cpuflamegraphs.html 
>> <http://www.brendangregg.com/FlameGraphs/cpuflamegraphs.html> for 
>> visualisation.
>>
>> /Marty
>>
>>> On 21 Feb 2017, at 20:31 , Arne Goedeke <[email protected]> wrote:
>>>
>>> Hi Marty,
>>>
>>> thanks!
>>>
>>> Yes, low_mega_apply still needs to be refactored. It is slightly more
>>> "complicated" because of APPLY_STACK, where the return value will
>>> overwrite the function on the stack. I want to fix the last crash in the
>>> testsuite before refactoring that. If you are interested in working on
>>> those, just let me know so we don't both do it ;)
>>>
>>> Adding more perf support would be great, do you have your code in a
>>> branch somewhere? I would be interested to have a look at it.
>>>
>>> Arne
>>>
>>> On 02/20/17 23:47, Martin Karlgren wrote:
>>>> Hi Arne,
>>>>
>>>> That’s awesome!
>>>>
>>>> I’d love to help (with the limited spare time I have.) I guess 
>>>> low_mega_apply should be refactored to make use of the new API too?
>>>>
>>>> Speaking of faster calls, I’ve incidentally been poking around a bit with 
>>>> machine code function calling conventions lately. For profiling purposes 
>>>> (i.e. Linux perf) I’ve added minimal call frame information to Pike 
>>>> functions in the amd64 machine code generator. I’ve gotten to the point 
>>>> where I can start Roxen and get proper stack traces from perf, but the 
>>>> testsuite still fails – it seems related to decoding of dumped bytecode, 
>>>> and I haven’t been able to sort out why.
>>>> Anyways, the good thing is that readymade visualisation tools built on 
>>>> perf output can be used to profile Pike code, and the interaction between 
>>>> Pike code and C functions is more apparent.
>>>> Examples from a very simple Roxen site being hit by apachebench:
>>>> http://marty.se/dotgraph.png <http://marty.se/dotgraph.png> (nodes with a 
>>>> “perf-17628.map” header represent Pike functions)
>>>> http://marty.se/flamegraph.svg <http://marty.se/flamegraph.svg> (time on 
>>>> horisontal axis, stack depth on vertical axis).
>>>>
>>>> Hopefully this can be used to weed out where we should start looking for 
>>>> optimisation candidates eventually.
>>>>
>>>> /Marty
>>>>
>>>
>>
>>
>

Re: Faster calls (again)

Reply via email to