I think I managed to fix the last issue. I was somehow confusing things and removed the locals from the stack before unlinking the stack frame. This of course broke trampolines. I also ended up rebasing the branch to get rid of the reverts I did at some point.
The current state passes the testsuite (the same tests as 8.1 at least). Performance wise it is roughly where 8.1 is, except for map/automap being significantly faster. There are some slowdowns currently, which are due to me removing some fast paths from the F_CALL_OTHER opcode. I will look into that. I readded most of the tracing code, however, some of it is unfinished and DTrace is probably broken. I have also not looked at PROFILING, yet, that is probably also not right yet. Sidenote: Profiling unfortunately does not work properly when fork()ing because timers change. It might even crash when running with debug mode because of that. But that is probably just a bug we need to fix. Whats currently left on my list before proposing to merge it into 8.1/8.3 * Make sure the map/automap optimizations do not break in pathological cases (e.g. objects being destructed or similar). * Maybe think about the API again (e.g. callsite_execute and callsite_return could be merged. same with callsite_init/callsite_set_args). Otherwise I played around with adding frame caching to apply_array, which looks promising performance wise. However, it takes some attention to make sure the stack traces are always correct. This would be a good test-case for caching frames in general. Anyway, feedback welcome, as usual, Arne On 02/22/17 09:37, Arne Goedeke wrote: > I am not quite sure, since I did not have the time to look into it, yet. > My feeling is that callsite_reset() is currently broken, probably when > trampolines are used. Its probably easy to fix. I was also planning to > write a couple of tests which try to cover all possible paths of the > function call API. Having to run the full testsuite can be quite annoying.. > > I also started adding some benchmarks for function calls to the > pike-benchmark repo. That might make it easier to tweak specific > optimizations. > > Arne > > On 02/21/17 22:12, Martin Karlgren wrote: >> Hi Arne, >> >> Alright. Any idea what the crash might be related to? >> >> I’ve pushed the marty/call_frames branch now. As mentioned, something breaks >> when precompiled bytecode is decoded, so many testsuite tests will segfault >> (since they are precompiled). >> >> Compiling --with-mc-stack-frames and running the very nice >> Debug.generate_perf_map() (previously implemented by TobiJ) should enable >> perf to extract what’s needed. I’ve used >> https://github.com/jrfonseca/gprof2dot >> <https://github.com/jrfonseca/gprof2dot> and >> http://www.brendangregg.com/FlameGraphs/cpuflamegraphs.html >> <http://www.brendangregg.com/FlameGraphs/cpuflamegraphs.html> for >> visualisation. >> >> /Marty >> >>> On 21 Feb 2017, at 20:31 , Arne Goedeke <[email protected]> wrote: >>> >>> Hi Marty, >>> >>> thanks! >>> >>> Yes, low_mega_apply still needs to be refactored. It is slightly more >>> "complicated" because of APPLY_STACK, where the return value will >>> overwrite the function on the stack. I want to fix the last crash in the >>> testsuite before refactoring that. If you are interested in working on >>> those, just let me know so we don't both do it ;) >>> >>> Adding more perf support would be great, do you have your code in a >>> branch somewhere? I would be interested to have a look at it. >>> >>> Arne >>> >>> On 02/20/17 23:47, Martin Karlgren wrote: >>>> Hi Arne, >>>> >>>> That’s awesome! >>>> >>>> I’d love to help (with the limited spare time I have.) I guess >>>> low_mega_apply should be refactored to make use of the new API too? >>>> >>>> Speaking of faster calls, I’ve incidentally been poking around a bit with >>>> machine code function calling conventions lately. For profiling purposes >>>> (i.e. Linux perf) I’ve added minimal call frame information to Pike >>>> functions in the amd64 machine code generator. I’ve gotten to the point >>>> where I can start Roxen and get proper stack traces from perf, but the >>>> testsuite still fails – it seems related to decoding of dumped bytecode, >>>> and I haven’t been able to sort out why. >>>> Anyways, the good thing is that readymade visualisation tools built on >>>> perf output can be used to profile Pike code, and the interaction between >>>> Pike code and C functions is more apparent. >>>> Examples from a very simple Roxen site being hit by apachebench: >>>> http://marty.se/dotgraph.png <http://marty.se/dotgraph.png> (nodes with a >>>> “perf-17628.map” header represent Pike functions) >>>> http://marty.se/flamegraph.svg <http://marty.se/flamegraph.svg> (time on >>>> horisontal axis, stack depth on vertical axis). >>>> >>>> Hopefully this can be used to weed out where we should start looking for >>>> optimisation candidates eventually. >>>> >>>> /Marty >>>> >>> >> >> >
