Hi Siarhei,
It is always possible to find time for fixing bugs and reviewing the code, which is doing something obviously useful.
I think we have been unlucky because it seems to me that our schedules haven't aligned very well. I have put time aside on a few occasions over the last couple of years to make a concentrated effort to get my ARM patches accepted, but there hasn't been anyone around with the ability to review them at those times. I can see you're working on it at present, but for personal reasons I don't have much time right now for detailed discussion, benchmarking and reworking of my patches! I still read the list though, so I was naturally very interested to see AArch64 being discussed. I hate to see effort going to waste though, whether it's mine or anyone else's, and it looked like Mizuki was unaware of my work, so I thought I should mention it.
I did give this branch a try. And I'm not sure if you genuinely did not notice this fact, but with your runs of "lowlevel-blt-bench -b" tests, you are actually benchmarking a special code path for horizontal-only scaling against the code that is doing scaling in both directions. That's an obvious flaw in the benchmark itself, which gives you misleading results!
In the past I had criticism for posting patch series that were too long to be easily reviewed, so I have been dribbling them in piece by piece. I see you also identified the limitations of lowlevel-blt-bench when it comes to scaled plots - that's why I created affine-bench. That program *has* been accepted into git, although my separable bilinear fetchers haven't yet, even though they were the reason why I wrote it! Perhaps you may not have noticed it? That specifically allows different scaling factors to be tested. For example, for my ARMv6 separable bilinear filter code, I recorded benchmarks in my commit log as follows: Improvement: x increment 0.5 0.75 1.0 1.5 2.0 y increment 0.5 +196.7% +183.6% +181.8% +206.6% +198.4% 0.75 +182.2% +166.2% +164.0% +194.8% +185.8% 1.0 +271.7% +234.4% +282.7% +257.9% 1.5 +154.6% +135.3% +134.3% +173.3% +164.8% 2.0 +144.1% +124.2% +123.3% +165.6% +155.5% It looks like I didn't record them in the ARMv7 separable bilinear filter commit, but some of my private notes indicate that before I wrote the ARMv7 separable bilinear filter, I was measuring my ARMv6 separable bilinear filter as around 20% faster on Cortex-A7 than the old ARMv7 bilinear fast path for small y increments, even though it didn't have the advantage of being able to use NEON instructions. That's how I decided it was worth the effort of writing the ARMv7 separable filter in the first place. The reason why I didn't include them is because profiling takes a lot of time, and because I was intending to dribble the commits to the mailing list in small batches, I decided to defer the detailed profiling until I was ready to post those patches, in case any other patches accepted in the meantime affected the numbers. It is interesting to notice that the ARMv6 improvements in my table above agree with your assessment that y factors of 1 are not particularly representative. It also shows that the separable filters show the most improvement when the y increment is small. This makes sense because the buffer containing first-pass scaled data gets more re-use when the y factor is small. I don't think it's unreasonable to pay close attention to the performance for small y increments, because that's exactly the case where you would expect to use bilinear filtering if you care about image quality. For increments above 1, image quality is best served by a multi-tap filter instead.
Your implementations are providing better performance on ARM Cortex-A7 for the special case of horizontal-only scaling. But when we are doing general purpose scaling (for example the same scale factor for the x and y axes), the performance is more or less comparable between the existing bilinear fast paths and your new implementation. And if we take a different processor, for example ARM Cortex-A9, then everything changes. Now the general purpose scaling is much slower with your code. And even the horizontal-only special case is somewhat slower than the old fast paths which run general purpose bilinear scaling.
OK, that's a bit disappointing, though it should be offset to some extent by the reduction in calculations required when using scaling factors less than 1, and of course the fact that the fetchers are applicable to a much wider range of operations than the fast paths are. It's possible I've chosen some code sequences that are particularly painful for Cortex-A9. Since ARM have stopped publishing details of cycle counts, interlocks etc, it has become quite difficult to hand-schedule code, and IIRC at least the NEON pipeline in the A9 is in-order, like the A8, A5, A7 and A53, and whilst they can be figured out by experiment, it's very time consuming and requires access to lots of hardware. There is the question of whether we would want Pixman to have separate routines optimized for each core, or just try to choose one routine which is best all-round. But then who decides which cores are most important to test with?
However if we try to do a really fair comparison, then we need to get rid of the redundant memcpy for the bilinear src_8888_8888 operation in the iterators based implementation. I had some old patches for this: https://lists.freedesktop.org/archives/pixman/2013-September/002879.html
Interesting. That seems a reasonable approach, and levels the playing field a bit between iterators and fast paths. My pass-2 (vertical interpolation) code currently assumes that the cacheline alignment is the same between its input and output buffers, but I can imagine that losing the memcpy that follows it for SRC operations could make it a net win. And in theory I could add versions of the pass-2 code to support differing alignments to improve the situation further.
I'll try to tune the single-pass bilinear scaling code for Cortex-A53 to see if it can become more competitive. There are some obvious performance problems in it. For example, Cortex-A53 wants the ARM and NEON instructions to be interleaved in perfect pairs, so we can't have separate chunks of ARM and NEON code anymore
Also interesting. I wasn't aware of that about the A53, although I admit I haven't gone looking for information about its microarchitecture yet. Are you aware of anywhere that information about the quirks of this and other cores are documented?
But you don't need to worry about this right now. Let's focus at one task at a time. So it's probably best to use the current pixman code for the initial AArch64 conversion round. And at the same time we can try to look at integrating your patches as 32-bit code first.
Thanks for all your detailed analysis, Siarhei. It's a very pleasant change after months of my patches being ignored or getting bogged down in discussing minutiae. If you've actually got time to look at some of my code at present, I could perhaps repost a few of them to get things started, perhaps focusing on the NEON ones as they're the ones pertinent to the AArch64 conversion. I can't promise to make major reworks to anything in the short term, but maybe at least some of them might be straightforward to get accepted. I'd love to be able to merge my branch fully eventually; at the moment Raspberry Pi is using my branch, but it'll cause a big headache if/when they move to AArch64. Ben _______________________________________________ Pixman mailing list Pixman@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/pixman