> But it is much more interesting to compare the performance of this > patch with the existing 32-bit ARM NEON code. You can build the > 32-bit tests programs (including lowlevel-blt-bench) using a 32-bit > ARM crosscompiler either on your desktop PC or on the ARM board: > > ./configure --host=arm-linux-gnueabihf --enable-static-testprogs \ > --disable-libpng --disable-gtk > make
Ok. I succeed to install cross-build toolchain in my Linux PC and create armeabihf version of pixman. And attached the benchmark result to the bug ticket. Please find: https://bugs.freedesktop.org/attachment.cgi?id=122695 At least, I have to catch up with this result on aarch64... On 4 April 2016 at 19:28, Mizuki Asakura <ed6e1...@gmail.com> wrote: >> OK, thanks for this information. What I mean is that this should be >> preferably a separate patch. > > I've sent it on the separated patch. > > On 3 April 2016 at 20:22, Siarhei Siamashka <siarhei.siamas...@gmail.com> > wrote: >> On Sat, 2 Apr 2016 20:28:44 +0900 >> Mizuki Asakura <ed6e1...@gmail.com> wrote: >> >>> Thanks for your reply. >>> >>> > The main concern is whether we can get support for all these >>> > assembly syntax flavours, calling conventions and minor instructions >>> > differences, while keeping the code maintainable. >>> >>> In the aarch64 point of view, I believe we should only support ARM's >>> original >>> syntax and calling convensions. It would be most portable. >> >> The ILP32 is an alternative ABI, which is being endorsed or even >> developed by ARM: >> http://infocenter.arm.com/help/topic/com.arm.doc.dai0490a/index.html >> >> When/if somebody tries to use pixman in an AArch64 ILP32 system, this >> may cause some difficulties/inconveniences for us. >> >>> In iOS case, they may be able to create some converter script that only >>> translate instruction name to their own since the architcture is same >>> (as opposed aarch32-neon vs aarch64-neon is (almost) same, but different). >> >> Yes. The converters exists, but AFAIK the pixman assembly code is >> way too difficult for them to parse. Again, the problem is technically >> solvable, just nobody has invested efforts into solving it yet. Either >> by contributing the needed code or by sponsoring this work. >> >> I mentioned this only because the unified 32-bit/64-bit ARM assembly >> preprocessor could also potentially take care of this iOS issue as a >> side effect. >> >>> # How about SSE2 on OS X ? Is it same syntax and calling convension ? >> >> I'm not familiar with OS X myself. But the SSE2 code in pixman is >> currently using intrinsics, so this makes everything quite simple >> as far as portability is concerned. >> >> If we were to use hand tuned assembly optimizations for x86, then we >> would have to take care of the 32-bit Linux/Windows, 64-bit Linux with >> x32 ABI, 64-bit Linux and 64-bit Windows cases separately. Windows >> arrived late to the x86-64 scene, but they still decided to invent >> their own incompatible calling conventions. All of this is fine for >> intrinsics, but may clearly cause problems for assembly implementations. >> >> Hand written assembly has a relatively high maintenance cost. But it >> pays off if we want to have good performance. >> >> I still see a large potential for improving pixman performance on >> x86 (replace intrinsics with hand written assembly and make a better >> use of SSSE3, AVX, etc.). But somebody has to shoulder the expenses >> for this to happen. BTW, I would be happy to take this task if >> somebody is willing to pay :-) >> >>> > /* stride is always multiple of 32bit units in pixman */ >>> > - uint32_t byte_stride = stride * sizeof(uint32_t); >>> > + int32_t byte_stride = stride * sizeof(uint32_t); >>> >>> It caused horribly crashes in aarch64. Why it works on aarch32 ? :) >>> (32-bit integer overflow may help them...) >> >> OK, thanks for this information. What I mean is that this should be >> preferably a separate patch. >> >>> > I'm currently investigating the possibility of maybe "inventing" a >>> > common unified syntax for some preprocessor script to take care of >>> > both 32-bit and 64-bit ARM code. But we still can't be sure if this >>> > approach is practical (or even doable) and whether it can solve more >>> > problems than introduce new ones... >>> >>> At first, I've also tried to create a "converter script" from >>> pixman-arm-neon-asm*.S to >>> aarch64 neon codes just translating instruction, register names and so on. >> >> I see. It is nice to know that we basically came up with the same >> solution at least as the starting point :-) >> >>> But I've found that this script won't work on pixman's code because of: >>> >>> * conflicting register names: both Qn and Dn are used in same place >>> * highly optimized for aarch32-neon register structure. assumes d29, >>> d30 is low / high of q15 >> >> It's all mostly just a syntax sugar. The Dn registers still do exist >> in a somewhat hidden way. For example: >> >> vshrn.u32 d0, q15, #1 >> vshrn.u32 d1, q15, #1 >> >> directly translates into >> >> shrn v0.4h, v15.4s, #1 >> shrn2 v0.8h, v15.4s, #1 >> >> The selection of SHRN or SHRN2 instruction variant decides whether >> the lower or upper half of the 128-bit destination register is updated. >> >> The 64-bit source operands (Dn registers) are handled in the very >> same way: >> >> vmull.u16 q1, d0, d0 >> vmull.u16 q2, d1, d1 >> >> directly translates into >> >> umull v1.4s, v0.4h, v0.4h >> umull2 v2.4s, v0.8h, v0.8h >> >> There are of course some limitations. Mixing odd and even >> numbered d-registers as the source operands is not supported >> in AArch64. This is a bit inconvenient, but the AArch64 code >> still needs to take this into account either way regardless >> of the syntax. >> >> I'm considering to add something like the following 'umull12' >> fake instruction: >> >> umull12 v1.4s, v0.4h, v0.8h >> >> which directly translates into >> >> vmull.u16 q1, d0, d1 >> >> So that we can write 32-bit assembly code in the AArch64 style. >> And use then use conditional compilation (#ifdef __aarch64__) >> for a few parts of code where we really need to have separate >> code paths for 32-bit and 64-bit mode. >> >> And such fake instructions are easy to grep in the source code >> if we do automatic 32-bit -> 64-bit conversion. >> >> >> The most challenging differences are the VZIP/VUZP instructions >> and also structure loads. >> >> Frankly speaking, I like the VZIP/VUZP behaviour change in AArch64. >> The 32-bit way of handling them was a 'destructive' update of the >> registers in-place. And for software pipelining we want to have >> a 'non-destructive' form to move the intermediate results between >> registers instead of updating them in-place. Fortunately pixman >> does not use these instructions too much. They are just used for >> conversion between 'planar' and 'packed' form via a sequence of >> 4 instructions as explained at: >> >> https://cgit.freedesktop.org/pixman/tree/pixman/pixman-arm-neon-asm.S?id=pixman-0.34.0#n128 >> >> This sequence of 4 instructions needs an AArch64 equivalent. Now we >> have macros 'pixdeinterleave' and 'pixinterleave' macros: >> >> https://cgit.freedesktop.org/pixman/tree/pixman/pixman-arm-neon-asm.h?id=pixman-0.34.0#n347 >> >> For the sake of allowing better instructions scheduling, we may >> split them into 4 separate macros (and abstract the 32-bit/64-bit >> differences) via something like this: >> pixdeinterleave_step1 ... >> pixdeinterleave_step2 ... >> pixdeinterleave_step3 ... >> pixdeinterleave_step4 ... >> Then these different steps can be interleaved with other >> instructions in the code. >> >> The structure load/store instructions in AArch64 handle low >> 64-bit register halves (even numbered Dn registers). And in >> 32-bit mode, the Dn registers are contiguous. This introduces >> the need to re-allocate the registers a bit between the >> 32-bit and 64-bit modes. This is actually the biggest challenge >> for the unified syntax converter. If I can get this resolved >> in a simple and clean way, then the job is basically done. >> >> >>> * assumes integer register is 32-bit width implicitly >> >> All the integer registers in the NEON assembly code have named aliases. >> So the 32-bit and 64-bit differences are already fully abstracted. We >> have already solved this particular problem long before the AArch64 >> even came to existence :-) >> >>> Once the script run correctly and generate something, I have to modify >>> all of the result. >>> So I concluded asm sources should be separated. >> >> Yes, if fully automated conversion does not work yet, then at least >> partially automated conversion makes sense. I have noticed that even >> the code enclosed under "#if 0" has also changed, so I already >> suspected that you used some scripts. >> >>> And I'm also afraid that, to generate aarch64 codes, original >>> pixman-arm-neon-asm*.S >>> may be modified, and it cause regressions to existing aarch32 worlds. >> >> Yes, the changes may be needed. But I don't expect them to be too >> disruptive. >> >>> The other point of view, >>> aarch32-neon is "fixed" because ARM won't add new instructions for aarch32 >>> and existing pixman-arm-neon-asm*.S would be completed and stable. So we >>> should >>> leave it as is. >>> Then aarch64 may continue to add new features and we should follow >>> them. So I think >>> aarch32-neon and aarch64-neon should be separated. >>> (Cortex-A32 ? Hmm...) >> >> There are still a lot of 32-bit ARM processors in use. And the >> current 32-bit assembly code is far from complete. We can still >> improve it. And will probably do this. >> >>> > So it looks like the prefetch either needs to be dropped in >>> > the 64-bit code, or we can still experiment with other prefetch >>> > strategies a bit more and find something that is more suitable. >>> >>> It's a point that I cannot decide that, should we use prefetch (none >>> or advance), >>> which memory to use (L1, L2, L3). Can we use L1 for this purpose ? >>> All of aarch64 have L2 cache ? >>> I think it should be configurable for each target CPUs. >> >> The pldl2strm prefetch modifier appears to be harmful for Cortex-A53. >> The pldl1keep works better. You can try to run >> >> https://github.com/ssvb/tinymembench >> >> on your device to get some memory related benchmark numbers (and >> probably share this information with us). On the PINE64 board I >> get this: >> >> https://github.com/ssvb/tinymembench/wiki/PINE64-(Allwinner-A64) >> >> The benchmark should show which prefetch method works better. I have >> tried to design the code so that we can support different prefetch >> methods relatively easily. The advanced prefetch is probably a bad >> fit for AArch64 because the condition execution is not supported >> anymore. We can also replace the 'advanced' prefetch with some other >> implementation on AArch64. You can try the simple-distance-ahead >> and same-pixel-on-next-scanline prefetch strategies. >> >>> > + umlsl v0.4s, v2.4h, v15.h[0] >>> > + umlal2 v0.4s, v2.8h, v15.h[0] >>> >>> !!! >>> I don't know the syntax. Yes, it can reduce many unnecessaly register >>> movings. >> >> The ARMv8 architecture manual has documentation for these instructions. >> As for the GNU assembler syntax, I had to look into binutils sources in >> order to figure out what would be the appropriate form. >> >>> > Moreover, this code can be the same in both 32-bit and 64-bit >>> > code. I have just submitted a patch for the 32-bit NEON code, >>> > which can reduce the difference between the 32-bit and 64-bit >>> > code: >>> > https://lists.freedesktop.org/archives/pixman/2016-April/004489.html >>> >>> Yes, these "common" and "no-side-effect (to aarch32)" modification would be >>> OK. >> >> Thanks! Can I have your Acked-by: or Reviewed-by: for this patch? >> >>> But, I'm afraid again, a patch to gain aarch64 performance, make regression >>> for aarch32 must not be applied. >> >> Don't worry, we are not going to regress the AArch32 support. >> >>> > I would suggest to first take the 'pixman-arma64-neon-asm-bilinear.S' >>> > file out of your initial patch submission in order to reduce its size >>> > and make the review process more manageable. Such reduced patch can be >>> > probably small enough to reach the mailing list. >>> >>> OK. I agree. >>> At first, we should start minimal set of codes. >> >> Well, I don't want to make it more difficult than necessary. Submitting >> the whole pixman-arm-neon-asm.S file in one go is fine for me. You can >> squash my patches into your code and simply mention this in the commit >> message with a reference to the mailing list archive. You should >> probably also add your own copyright notice to the new assembly files. >> >> If you have difficulties getting rid of some redundant instructions, >> just ask and I will try to help. We really want to have a performance >> parity between 32-bit and 64-bit assembly code. Because this is a >> deciding factor for people when they are considering to switch from >> the 32-bit to 64-bit userland. The Raspberry Pi 3 is yet to make this >> decision. And I don't want to see people starting to make claims that >> the 64-bit mode is slow :-) >> >> If we don't get rid of the redundant instructions right now, then >> 32-bit and 64-bit assembly sources may diverge and it will become >> much more difficult to do this job later. >> >>> I'll also planned to disable "nearest" codes, is it OK ? (or anyone use >>> it ? How about "0565" ?) >> >> It is a part of the functionality, provided by the pixman API. There is >> no reason to drop nearest scaling or rgb565 support. I would say that we >> should still have these optimizations available for AArch64. >> >> As I said above, if some of the rgb565 code is difficult to make as >> efficient as the 32-bit variant, then I can try to help. Just fix >> whatever you can. >> >>> > I also see a lot of label renames in your 64-bit code. Were all of them >>> > really necessary? Maybe we should also rename the labels in the 32-bit >>> > assembly source file if such renames are justified? >>> >>> I've met some compiler (gnu as) bug. >>> In nested macro, "gnu-as" failed to deternine the label and it causes >>> infinite loop. >> >> It might be nice to report this bug to binutils (if you haven't done >> this yet). >> >>> Since aarch64 have removed many "conditional" syntax, I have to add many >>> branch instructions especially for prefetching. So the bug occurs (or >>> I've failed >>> to follow the gnu-as rule). >>> I don't want to be annoyed such difficult-to-find behavior, all label names >>> were >>> renamed not to cause confliction. >> >> We still can do the same labels rename in the 32-bit assembly source. >> This will allow us to reduce the difference between 32-bit and 64-bit >> assembly sources. >> >>> Anyway, I'll post new (simplified) patch again. >> >> Thanks. But maybe first try to remove as many redundant instructions >> as possible. You can create a repository at github and temporarily >> push the preliminary fixes there. Send a cleaned version to the mailing >> list when you think that it is ready or substantially improved. >> >> You can use the http://meldmerge.org/ tool to compare the differences >> between the 32-bit assembly source and the 64-bit assembly source on >> your computer locally. And also edit the source code directly in it. >> >> -- >> Best regards, >> Siarhei Siamashka _______________________________________________ Pixman mailing list Pixman@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/pixman