Re: Basecase assembly optimisation project

2013-10-03 Thread Torbjorn Granlund
ni...@lysator.liu.se (Niels Möller) writes:

  Another feature, which I looked into while ago without getting very far
  with the loopmixer, is to make it understand associativity. I.e, try
  reordering certain instructions with the same destination register, like
  
xor %r8, %rax
xor %r9, %rax
  
  or
  
add %r8, %rax
add %r9, %rax

  This is possible if nothing else depends on the intermediate results
  (including no dependencies on status flags).
  
  Probably more useful for cryptographic loops than for arithmetic.
  
I think this couldd be very useful also for the critical multiply-
accumulate loops of GMP.  The schema loops I create for the loopmixer
try many accumulation strategies.  This is a lot of work, but I only in
practice try regular variants, while an automatic random-driven
re-association feature could try beneficial irregular variants too.

-- 
Torbjörn
___
gmp-devel mailing list
gmp-devel@gmplib.org
http://gmplib.org/mailman/listinfo/gmp-devel


Re: Basecase assembly optimisation project

2013-10-03 Thread Torbjorn Granlund
Ondřej Bílka nel...@seznam.cz writes:

  It is possible enchancement, but I am not yet at stage of calculating
  register dependencies on jumps.
  
That's someting we do, but we only handle a simple jump-back for the
loop.

(That branch limitation is a slght problem for some division loops,
which use an unlikely adjust branch.)

-- 
Torbjörn
___
gmp-devel mailing list
gmp-devel@gmplib.org
http://gmplib.org/mailman/listinfo/gmp-devel


Re: Basecase assembly optimisation project

2013-10-02 Thread Torbjorn Granlund
Ondřej Bílka nel...@seznam.cz writes:

  I am writing a tool that might be useful, a simple optimizer of assembly
  routines. You need to write a benchmark that measures performance and
  prints elapsed time and assembly file. Currently it has two optimization
  patterns, first is enclosing block of code without control flow
  instructions in
  
  # EVOLVE_START
  code
  # EVOLVE_END
  
  and evolver will try different schedulings to find optimal one.
  
  Second one is when you choose between different instructions, where you 
  write alternatives by following form.
  
  cmp $1, %rax #| cmp $1, %al
  
  A sequences can be done in following way
  
  cmp $1, %rax; ja .foo #| test %eax, %eax; jne .foo
  
  sounds interesting?
  
It sounds a lot like David Harvey's and my tool, which we call the
loopmixer.  We never wrote it neatly enough for public release, but
we've used it for the last 4 years or so for tweaking the assembly code
of GMP.  (There is a slightly cryptic message in many asm files in GMP
about this.)

We don't handle alternatives currently, except with a loop around the
loopmixer.  One could think of several classes, some known to the tool,
others not-always-valid only by source file annotation.  Examples of the
former would be xor rax,rax vs xor eax,eax and mov reg,reg vs lea
(reg),reg.

For x86-64, CPUs are affected more or less by which register is used,
which could be understood by the tool.  (E.g., rbp and r13 lack he
shorter 0-offset address for; rsp and r12 require an extra 0x24 when
used as base ptrs; r8-r15 require REX prefix.)

Taking all this into account might through us out in the land of
combinatorial explosion.

Keeping software pipeline depth down would also be useful.  Our tool
doesn't understand that.  When one give instruction choices, manual
preferences and automatic preferences would be useful.  Automatic
preferences could be to use a short insn, manual could be to prefer mov
over lea to make code more readable.  (Not that mov is much more
readable, but to avoid a random mix of he two in similar contexts.)

-- 
Torbjörn
___
gmp-devel mailing list
gmp-devel@gmplib.org
http://gmplib.org/mailman/listinfo/gmp-devel


Re: Basecase assembly optimisation project

2013-10-02 Thread Ondřej Bílka
On Wed, Oct 02, 2013 at 08:19:29PM +0200, Niels Möller wrote:
 Torbjorn Granlund t...@gmplib.org writes:
 
  We don't handle alternatives currently, except with a loop around the
  loopmixer.  One could think of several classes, some known to the tool,
  others not-always-valid only by source file annotation.  Examples of the
  former would be xor rax,rax vs xor eax,eax and mov reg,reg vs lea
  (reg),reg.
 
 Another feature, which I looked into while ago without getting very far
 with the loopmixer, is to make it understand associativity. I.e, try
 reordering certain instructions with the same destination register, like
 
   xor %r8, %rax
   xor %r9, %rax
 
 or
 
   add %r8, %rax
   add %r9, %rax
   
 This is possible if nothing else depends on the intermediate results
 (including no dependencies on status flags).

This is relatively easy to add, my basic block is swap two adjacent
instructions and I have list of pairs that are possible like
addq addq
addq subq
pminub pminub
...

There is one feature that I plan to add which is recalculating addresses
like transforming between

addq $64, %rdi
movdqa (%rdi), %xmm0
and
movdqa 64(%rdi), %xmm0
addq $64, %rdi

In mine program is programmer responsibility to annotate parts 
without control flow.

 
 Probably more useful for cryptographic loops than for arithmetic.

With vector loops this is also usefull.
 
 Regards,
 /Niels
 
 -- 
 Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26.
 Internet email is subject to wholesale government surveillance.

___
gmp-devel mailing list
gmp-devel@gmplib.org
http://gmplib.org/mailman/listinfo/gmp-devel