> OK, thanks for this information. What I mean is that this should be > preferably a separate patch.
I've sent it on the separated patch. On 3 April 2016 at 20:22, Siarhei Siamashka <siarhei.siamas...@gmail.com> wrote: > On Sat, 2 Apr 2016 20:28:44 +0900 > Mizuki Asakura <ed6e1...@gmail.com> wrote: > >> Thanks for your reply. >> >> > The main concern is whether we can get support for all these >> > assembly syntax flavours, calling conventions and minor instructions >> > differences, while keeping the code maintainable. >> >> In the aarch64 point of view, I believe we should only support ARM's original >> syntax and calling convensions. It would be most portable. > > The ILP32 is an alternative ABI, which is being endorsed or even > developed by ARM: > http://infocenter.arm.com/help/topic/com.arm.doc.dai0490a/index.html > > When/if somebody tries to use pixman in an AArch64 ILP32 system, this > may cause some difficulties/inconveniences for us. > >> In iOS case, they may be able to create some converter script that only >> translate instruction name to their own since the architcture is same >> (as opposed aarch32-neon vs aarch64-neon is (almost) same, but different). > > Yes. The converters exists, but AFAIK the pixman assembly code is > way too difficult for them to parse. Again, the problem is technically > solvable, just nobody has invested efforts into solving it yet. Either > by contributing the needed code or by sponsoring this work. > > I mentioned this only because the unified 32-bit/64-bit ARM assembly > preprocessor could also potentially take care of this iOS issue as a > side effect. > >> # How about SSE2 on OS X ? Is it same syntax and calling convension ? > > I'm not familiar with OS X myself. But the SSE2 code in pixman is > currently using intrinsics, so this makes everything quite simple > as far as portability is concerned. > > If we were to use hand tuned assembly optimizations for x86, then we > would have to take care of the 32-bit Linux/Windows, 64-bit Linux with > x32 ABI, 64-bit Linux and 64-bit Windows cases separately. Windows > arrived late to the x86-64 scene, but they still decided to invent > their own incompatible calling conventions. All of this is fine for > intrinsics, but may clearly cause problems for assembly implementations. > > Hand written assembly has a relatively high maintenance cost. But it > pays off if we want to have good performance. > > I still see a large potential for improving pixman performance on > x86 (replace intrinsics with hand written assembly and make a better > use of SSSE3, AVX, etc.). But somebody has to shoulder the expenses > for this to happen. BTW, I would be happy to take this task if > somebody is willing to pay :-) > >> > /* stride is always multiple of 32bit units in pixman */ >> > - uint32_t byte_stride = stride * sizeof(uint32_t); >> > + int32_t byte_stride = stride * sizeof(uint32_t); >> >> It caused horribly crashes in aarch64. Why it works on aarch32 ? :) >> (32-bit integer overflow may help them...) > > OK, thanks for this information. What I mean is that this should be > preferably a separate patch. > >> > I'm currently investigating the possibility of maybe "inventing" a >> > common unified syntax for some preprocessor script to take care of >> > both 32-bit and 64-bit ARM code. But we still can't be sure if this >> > approach is practical (or even doable) and whether it can solve more >> > problems than introduce new ones... >> >> At first, I've also tried to create a "converter script" from >> pixman-arm-neon-asm*.S to >> aarch64 neon codes just translating instruction, register names and so on. > > I see. It is nice to know that we basically came up with the same > solution at least as the starting point :-) > >> But I've found that this script won't work on pixman's code because of: >> >> * conflicting register names: both Qn and Dn are used in same place >> * highly optimized for aarch32-neon register structure. assumes d29, >> d30 is low / high of q15 > > It's all mostly just a syntax sugar. The Dn registers still do exist > in a somewhat hidden way. For example: > > vshrn.u32 d0, q15, #1 > vshrn.u32 d1, q15, #1 > > directly translates into > > shrn v0.4h, v15.4s, #1 > shrn2 v0.8h, v15.4s, #1 > > The selection of SHRN or SHRN2 instruction variant decides whether > the lower or upper half of the 128-bit destination register is updated. > > The 64-bit source operands (Dn registers) are handled in the very > same way: > > vmull.u16 q1, d0, d0 > vmull.u16 q2, d1, d1 > > directly translates into > > umull v1.4s, v0.4h, v0.4h > umull2 v2.4s, v0.8h, v0.8h > > There are of course some limitations. Mixing odd and even > numbered d-registers as the source operands is not supported > in AArch64. This is a bit inconvenient, but the AArch64 code > still needs to take this into account either way regardless > of the syntax. > > I'm considering to add something like the following 'umull12' > fake instruction: > > umull12 v1.4s, v0.4h, v0.8h > > which directly translates into > > vmull.u16 q1, d0, d1 > > So that we can write 32-bit assembly code in the AArch64 style. > And use then use conditional compilation (#ifdef __aarch64__) > for a few parts of code where we really need to have separate > code paths for 32-bit and 64-bit mode. > > And such fake instructions are easy to grep in the source code > if we do automatic 32-bit -> 64-bit conversion. > > > The most challenging differences are the VZIP/VUZP instructions > and also structure loads. > > Frankly speaking, I like the VZIP/VUZP behaviour change in AArch64. > The 32-bit way of handling them was a 'destructive' update of the > registers in-place. And for software pipelining we want to have > a 'non-destructive' form to move the intermediate results between > registers instead of updating them in-place. Fortunately pixman > does not use these instructions too much. They are just used for > conversion between 'planar' and 'packed' form via a sequence of > 4 instructions as explained at: > > https://cgit.freedesktop.org/pixman/tree/pixman/pixman-arm-neon-asm.S?id=pixman-0.34.0#n128 > > This sequence of 4 instructions needs an AArch64 equivalent. Now we > have macros 'pixdeinterleave' and 'pixinterleave' macros: > > https://cgit.freedesktop.org/pixman/tree/pixman/pixman-arm-neon-asm.h?id=pixman-0.34.0#n347 > > For the sake of allowing better instructions scheduling, we may > split them into 4 separate macros (and abstract the 32-bit/64-bit > differences) via something like this: > pixdeinterleave_step1 ... > pixdeinterleave_step2 ... > pixdeinterleave_step3 ... > pixdeinterleave_step4 ... > Then these different steps can be interleaved with other > instructions in the code. > > The structure load/store instructions in AArch64 handle low > 64-bit register halves (even numbered Dn registers). And in > 32-bit mode, the Dn registers are contiguous. This introduces > the need to re-allocate the registers a bit between the > 32-bit and 64-bit modes. This is actually the biggest challenge > for the unified syntax converter. If I can get this resolved > in a simple and clean way, then the job is basically done. > > >> * assumes integer register is 32-bit width implicitly > > All the integer registers in the NEON assembly code have named aliases. > So the 32-bit and 64-bit differences are already fully abstracted. We > have already solved this particular problem long before the AArch64 > even came to existence :-) > >> Once the script run correctly and generate something, I have to modify >> all of the result. >> So I concluded asm sources should be separated. > > Yes, if fully automated conversion does not work yet, then at least > partially automated conversion makes sense. I have noticed that even > the code enclosed under "#if 0" has also changed, so I already > suspected that you used some scripts. > >> And I'm also afraid that, to generate aarch64 codes, original >> pixman-arm-neon-asm*.S >> may be modified, and it cause regressions to existing aarch32 worlds. > > Yes, the changes may be needed. But I don't expect them to be too > disruptive. > >> The other point of view, >> aarch32-neon is "fixed" because ARM won't add new instructions for aarch32 >> and existing pixman-arm-neon-asm*.S would be completed and stable. So we >> should >> leave it as is. >> Then aarch64 may continue to add new features and we should follow >> them. So I think >> aarch32-neon and aarch64-neon should be separated. >> (Cortex-A32 ? Hmm...) > > There are still a lot of 32-bit ARM processors in use. And the > current 32-bit assembly code is far from complete. We can still > improve it. And will probably do this. > >> > So it looks like the prefetch either needs to be dropped in >> > the 64-bit code, or we can still experiment with other prefetch >> > strategies a bit more and find something that is more suitable. >> >> It's a point that I cannot decide that, should we use prefetch (none >> or advance), >> which memory to use (L1, L2, L3). Can we use L1 for this purpose ? >> All of aarch64 have L2 cache ? >> I think it should be configurable for each target CPUs. > > The pldl2strm prefetch modifier appears to be harmful for Cortex-A53. > The pldl1keep works better. You can try to run > > https://github.com/ssvb/tinymembench > > on your device to get some memory related benchmark numbers (and > probably share this information with us). On the PINE64 board I > get this: > > https://github.com/ssvb/tinymembench/wiki/PINE64-(Allwinner-A64) > > The benchmark should show which prefetch method works better. I have > tried to design the code so that we can support different prefetch > methods relatively easily. The advanced prefetch is probably a bad > fit for AArch64 because the condition execution is not supported > anymore. We can also replace the 'advanced' prefetch with some other > implementation on AArch64. You can try the simple-distance-ahead > and same-pixel-on-next-scanline prefetch strategies. > >> > + umlsl v0.4s, v2.4h, v15.h[0] >> > + umlal2 v0.4s, v2.8h, v15.h[0] >> >> !!! >> I don't know the syntax. Yes, it can reduce many unnecessaly register >> movings. > > The ARMv8 architecture manual has documentation for these instructions. > As for the GNU assembler syntax, I had to look into binutils sources in > order to figure out what would be the appropriate form. > >> > Moreover, this code can be the same in both 32-bit and 64-bit >> > code. I have just submitted a patch for the 32-bit NEON code, >> > which can reduce the difference between the 32-bit and 64-bit >> > code: >> > https://lists.freedesktop.org/archives/pixman/2016-April/004489.html >> >> Yes, these "common" and "no-side-effect (to aarch32)" modification would be >> OK. > > Thanks! Can I have your Acked-by: or Reviewed-by: for this patch? > >> But, I'm afraid again, a patch to gain aarch64 performance, make regression >> for aarch32 must not be applied. > > Don't worry, we are not going to regress the AArch32 support. > >> > I would suggest to first take the 'pixman-arma64-neon-asm-bilinear.S' >> > file out of your initial patch submission in order to reduce its size >> > and make the review process more manageable. Such reduced patch can be >> > probably small enough to reach the mailing list. >> >> OK. I agree. >> At first, we should start minimal set of codes. > > Well, I don't want to make it more difficult than necessary. Submitting > the whole pixman-arm-neon-asm.S file in one go is fine for me. You can > squash my patches into your code and simply mention this in the commit > message with a reference to the mailing list archive. You should > probably also add your own copyright notice to the new assembly files. > > If you have difficulties getting rid of some redundant instructions, > just ask and I will try to help. We really want to have a performance > parity between 32-bit and 64-bit assembly code. Because this is a > deciding factor for people when they are considering to switch from > the 32-bit to 64-bit userland. The Raspberry Pi 3 is yet to make this > decision. And I don't want to see people starting to make claims that > the 64-bit mode is slow :-) > > If we don't get rid of the redundant instructions right now, then > 32-bit and 64-bit assembly sources may diverge and it will become > much more difficult to do this job later. > >> I'll also planned to disable "nearest" codes, is it OK ? (or anyone use >> it ? How about "0565" ?) > > It is a part of the functionality, provided by the pixman API. There is > no reason to drop nearest scaling or rgb565 support. I would say that we > should still have these optimizations available for AArch64. > > As I said above, if some of the rgb565 code is difficult to make as > efficient as the 32-bit variant, then I can try to help. Just fix > whatever you can. > >> > I also see a lot of label renames in your 64-bit code. Were all of them >> > really necessary? Maybe we should also rename the labels in the 32-bit >> > assembly source file if such renames are justified? >> >> I've met some compiler (gnu as) bug. >> In nested macro, "gnu-as" failed to deternine the label and it causes >> infinite loop. > > It might be nice to report this bug to binutils (if you haven't done > this yet). > >> Since aarch64 have removed many "conditional" syntax, I have to add many >> branch instructions especially for prefetching. So the bug occurs (or >> I've failed >> to follow the gnu-as rule). >> I don't want to be annoyed such difficult-to-find behavior, all label names >> were >> renamed not to cause confliction. > > We still can do the same labels rename in the 32-bit assembly source. > This will allow us to reduce the difference between 32-bit and 64-bit > assembly sources. > >> Anyway, I'll post new (simplified) patch again. > > Thanks. But maybe first try to remove as many redundant instructions > as possible. You can create a repository at github and temporarily > push the preliminary fixes there. Send a cleaned version to the mailing > list when you think that it is ready or substantially improved. > > You can use the http://meldmerge.org/ tool to compare the differences > between the 32-bit assembly source and the 64-bit assembly source on > your computer locally. And also edit the source code directly in it. > > -- > Best regards, > Siarhei Siamashka _______________________________________________ Pixman mailing list Pixman@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/pixman