On Tue, Apr 5, 2011 at 7:46 AM, Taekyun Kim <podai...@gmail.com> wrote: > When we reorder some assembly lines, proper comments should be inserted for > the people who tries to understand the code. > It also make it easy later refactoring or something.
I could not come up with any better solution other than just using different indentation for logically independent code blocks. Something like converting: /* do A */ A1 A2 /* 1 cycle stall */ A3 /* 2 cycles stall */ A4 /* done with A */ /* do B */ B1 B2 B3 /* done with B */ into /* do A */ A1 A2 /* do B */ B1 A3 B2 B3 /* done with B */ A4 /* done with A */ Of course, people reading the source code need to know about this "convention". And it has its own disadvantages too. If anyone can propose something more maintainable and easier to read, I'm all ears. Maybe changing to the use of native codegenerator to compile fast path code at runtime could make it easier. If we do a good job teaching it to know instruction scheduling rules well enough. > Although I still have some questions about effectiveness of reordering(SW > pipelining) on out-of-order core, We don't have Cortex-A15 in our hands yet, so we don't know for sure :) Cortex-A9 has almost the same NEON unit as Cortex-A8, but does not support even limited dual issue for NEON instructions anymore. There are also other minor differences, mostly related to load/store instructions but they very rarely show up. > it significantly affects the performance on mostly used in-order superscalar > CPUs. Yes, and the results are quite measurable (tens of percents in many cases). > And pixman_composite_over_n_8888_0565_ca_process_pixblock_head in tail_head > block increases code size causing i-cache miss. > We can think of jumping to head and then return to next part of tail_head > block. > But it seems difficult to do that without breaking > generate_composite_function macro template. The whole point of the use of software pipelining here is being able to overlap the last part of previous iteration with the beginning of current iteration. Jumping to "head" does not make much sense because the instructions from "head" and "tail" can be reordered quite wildly, diffusing into each other, so there is no clear border anymore. And yes, unfortunately pipelining doubles the size of code. Same as unrolling. It may be possible to add a way to disable pipelining for some of the fast paths via some new special flag. So that the code size can be reduced for the fast paths where pipelining provides too little or no performance gain. > One last thing is about practical benefits of this fast paths. > The most important customer of these fast path functions is glyph rendering. > Cairo's image backend show_glyph() functions composite cached glyph masks > with solid color pattern using operator OVER in most cases. > When there're overlaps between glyph boxes (by kerning or something), cairo > create a mask equal size with entire text extent, > and then accumulate component alpha with operator ADD and then composite > using entire mask. > So for non-overlapped cases, small sized OVER composite will frequently > happen and small sized ADD composite for overlapped cases. > We need to focus at the performance of small sized image composition with > both operator OVER and ADD. There were some ideas about improving text rendering in general and extending pixman API to handle this task more effectively. > The overhead for small sized images can be approximately identified by > comparing rendering times of drawing total n pixels in m times (n/m pixels > per one drawing) where m increases start from 1. > This can contain function calling overhead, cache overhead, code control > overhead, etc. I think it might be interesting for you. I also have the following experimental branch: http://cgit.freedesktop.org/~siamashka/pixman/log/?h=playground/slow-path-reporter It collects statistics about what operations do not have optimized fast paths, along with the number of uses of these operations, total number of pixels processed, average number of pixels per operation and average scanline length. The code is currently linux specific and writes results to syslog. These results can be converted into a more human readable form by a script. I'm using it quite successfully and it revealed some of the missing optimizations which would be hard to identify in some other way. -- Best regards, Siarhei Siamashka _______________________________________________ Pixman mailing list Pixman@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/pixman