Re: Basecase assembly optimisation project
ni...@lysator.liu.se (Niels Möller) writes: Another feature, which I looked into while ago without getting very far with the loopmixer, is to make it understand associativity. I.e, try reordering certain instructions with the same destination register, like xor %r8, %rax xor %r9, %rax or add %r8, %rax add %r9, %rax This is possible if nothing else depends on the intermediate results (including no dependencies on status flags). Probably more useful for cryptographic loops than for arithmetic. I think this couldd be very useful also for the critical multiply- accumulate loops of GMP. The schema loops I create for the loopmixer try many accumulation strategies. This is a lot of work, but I only in practice try regular variants, while an automatic random-driven re-association feature could try beneficial irregular variants too. -- Torbjörn ___ gmp-devel mailing list gmp-devel@gmplib.org http://gmplib.org/mailman/listinfo/gmp-devel
Re: Basecase assembly optimisation project
Ondřej Bílka nel...@seznam.cz writes: It is possible enchancement, but I am not yet at stage of calculating register dependencies on jumps. That's someting we do, but we only handle a simple jump-back for the loop. (That branch limitation is a slght problem for some division loops, which use an unlikely adjust branch.) -- Torbjörn ___ gmp-devel mailing list gmp-devel@gmplib.org http://gmplib.org/mailman/listinfo/gmp-devel
Re: Basecase assembly optimisation project
Ondřej Bílka nel...@seznam.cz writes: I am writing a tool that might be useful, a simple optimizer of assembly routines. You need to write a benchmark that measures performance and prints elapsed time and assembly file. Currently it has two optimization patterns, first is enclosing block of code without control flow instructions in # EVOLVE_START code # EVOLVE_END and evolver will try different schedulings to find optimal one. Second one is when you choose between different instructions, where you write alternatives by following form. cmp $1, %rax #| cmp $1, %al A sequences can be done in following way cmp $1, %rax; ja .foo #| test %eax, %eax; jne .foo sounds interesting? It sounds a lot like David Harvey's and my tool, which we call the loopmixer. We never wrote it neatly enough for public release, but we've used it for the last 4 years or so for tweaking the assembly code of GMP. (There is a slightly cryptic message in many asm files in GMP about this.) We don't handle alternatives currently, except with a loop around the loopmixer. One could think of several classes, some known to the tool, others not-always-valid only by source file annotation. Examples of the former would be xor rax,rax vs xor eax,eax and mov reg,reg vs lea (reg),reg. For x86-64, CPUs are affected more or less by which register is used, which could be understood by the tool. (E.g., rbp and r13 lack he shorter 0-offset address for; rsp and r12 require an extra 0x24 when used as base ptrs; r8-r15 require REX prefix.) Taking all this into account might through us out in the land of combinatorial explosion. Keeping software pipeline depth down would also be useful. Our tool doesn't understand that. When one give instruction choices, manual preferences and automatic preferences would be useful. Automatic preferences could be to use a short insn, manual could be to prefer mov over lea to make code more readable. (Not that mov is much more readable, but to avoid a random mix of he two in similar contexts.) -- Torbjörn ___ gmp-devel mailing list gmp-devel@gmplib.org http://gmplib.org/mailman/listinfo/gmp-devel
Re: Basecase assembly optimisation project
On Wed, Oct 02, 2013 at 08:19:29PM +0200, Niels Möller wrote: Torbjorn Granlund t...@gmplib.org writes: We don't handle alternatives currently, except with a loop around the loopmixer. One could think of several classes, some known to the tool, others not-always-valid only by source file annotation. Examples of the former would be xor rax,rax vs xor eax,eax and mov reg,reg vs lea (reg),reg. Another feature, which I looked into while ago without getting very far with the loopmixer, is to make it understand associativity. I.e, try reordering certain instructions with the same destination register, like xor %r8, %rax xor %r9, %rax or add %r8, %rax add %r9, %rax This is possible if nothing else depends on the intermediate results (including no dependencies on status flags). This is relatively easy to add, my basic block is swap two adjacent instructions and I have list of pairs that are possible like addq addq addq subq pminub pminub ... There is one feature that I plan to add which is recalculating addresses like transforming between addq $64, %rdi movdqa (%rdi), %xmm0 and movdqa 64(%rdi), %xmm0 addq $64, %rdi In mine program is programmer responsibility to annotate parts without control flow. Probably more useful for cryptographic loops than for arithmetic. With vector loops this is also usefull. Regards, /Niels -- Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26. Internet email is subject to wholesale government surveillance. ___ gmp-devel mailing list gmp-devel@gmplib.org http://gmplib.org/mailman/listinfo/gmp-devel