> Do you have a more specific example of a code fragment that needs > conversion?
In original pixman-arm-neon-asm.S: .macro pixman_composite_over_8888_8_0565_process_pixblock_head ... vsli.u16 q2, q2, #5 ... vraddhn.u16 d2, q6, q10 ... vshrn.u16 d30, q2, #2 If all registers just converted to Vn, it would be as follows: .macro pixman_composite_over_8888_8_0565_process_pixblock_head ... sli v2.8h, v2.8h, #5 ... raddhn v2.8b, v6.8h, v10.8h ... shrn v30.8b, v2.8h, #2 The second raddhn corrupts v2, then the next shrn v30.8b, v2.8h #2 would not be correct. There are many other conflicts I've met. I didn't find any specification on the ARM's document that Dn can be a lower part of V(n/2). On 7 April 2016 at 16:31, Siarhei Siamashka <siarhei.siamas...@gmail.com> wrote: > On Tue, 5 Apr 2016 20:20:54 +0900 > Mizuki Asakura <ed6e1...@gmail.com> wrote: > >> > This code is not just there for prefetching. It is an example of >> > using software pipelining: >> >> OK. I understand. >> But the code is very hard to maintain... I've met too many register >> conflictions. > > The *_tail_head variant has exactly the same code as the individual > *_head and *_tail macros, but the instructions are just reordered. > There should be no additional register clashes if you are doing the > exact 1-to-1 conversion. > > If you are considering to modify the algorithm, then now it's better > not to touch these problematic parts of code and keep them the way > they were in your first patch. > >> # q2 and d2 were used in a same sequence. It cannot be exist in aarch64-neon. > > Why not? The registers mapping is something like this: > q2 -> v2.16b > d2 -> v1.8b (because d2 is the lower 64-bit half of the 128-bit q1 > register) > > Do you have a more specific example of a code fragment that needs > conversion? > >> Anyway, I'll try to remove unnecessary register copies as you've suggested. >> After that, I'll also tryh to make benchmarks that >> * advance vs none >> * L1 / L2 / L3 (Cortex-A53 doesn't have), keep / strm >> to find the better configuration. > > OK, thanks. Just don't overwork yourself. What I suggested was only > fixing a few very obvious and trivial things in the next revision of > your patch. Then I could have a look at what is remaining and maybe > have some ideas about how to fix it (or maybe not). > > The benchmark against the 32-bit code is useful for prioritizing > this work (pay more attention to the things that have slowed down > the most). > > But I think that your patch is already almost good enough. And it is > definitely very useful for the users of AArch64 hardware, so we > probably want to have it applied and released as soon as possible. > >> But it is only a result of Cortex-A53 (that you ane me have). Does anyone can >> test other (expensive :) aarch64 environment ? >> (Cortex-Axx, Apple Ax, NVidia Denver, etc, etc...) > > I have a ssh access to Cavium ThunderX and APM X-Gene. Both of these > are "server" ARM processors and taking care of graphics/multimedia is > not their primary task. > > The ThunderX is a 48-core processor with very small and simple > individual cores, optimized for reducing the number of transistors. > It even does not implement the 32-bit mode at all. So we can't > compare the performance of the 32-bit and the 64-bit pixman code > on it. ThunderX has a 64-bit NEON data path. Moreover, it has a > particularly bad microcoded implementation of some NEON instructions, > for example the TBL instruction needs 320 (!) cycles: > https://gcc.gnu.org/ml/gcc-patches/2015-06/msg01676.html > > The X-Gene is reasonably fast out-of-order processor with a wide > instructions decoder, which can run normal code reasonably fast. > However it also only has a 64-bit NEON data path. > > A low power, but more multimedia oriented Cortex-A53 with a full > 128-bit NEON data path is faster than either of these when running > NEON code. > > I can try to run the lowlevel-blt-bench on X-Gene and provide the > 32-bit and 64-bit logs. However I'm not the only user of that > machine and running the benchmark undisturbed may be problematic. > > Either way, we are very likely just going to see that reducing the > number of redundant instructions has a positive impact on performance. > In a pretty much similar way as on Cortex-A53. > > -- > Best regards, > Siarhei Siamashka _______________________________________________ Pixman mailing list Pixman@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/pixman