Am 13.03.2012 20:07, schrieb Stroller:
> 
> On 13 March 2012, at 18:18, Michael Mol wrote:
>> ...
>>> So I assume the i586 version is better for you --- unless GCC
>>> suddenly got a lot better at optimizing code.
>> 
>> Since when, exactly? GCC isn't the best compiler at optimization,
>> but I fully expect current versions to produce better code for
>> x86-64 than hand-tuned i586. Wider registers, more registers,
>> crypto acceleration instructions and SIMD instructions are all very
>> nice to have. I don't know the specifics of AES, though, or what
>> kind of crypto algorithm it is, so it's entirely possible that one
>> can't effectively parallelize it except in some relatively unique
>> circumstances.
> 
> Do you have much experience of writing assembler?
> 
> I don't, and I'm not an expert on this, but I've read the odd blog
> article on this subject over the years.
> 
> What I've read often has the programmer looking at the compiled gcc
> bytecode and examining what it does. The compiler might not care how
> many registers it uses, and thus a variable might find itself
> frequently swapped back into RAM; the programmer does not have any
> control over the compiler, and IIRC some flags reserve a register for
> degugging (IIRC -fomit-frame-pointer disables this). I think it's
> possible to use registers more efficiently by swapping them (??) or
> by using bitwise comparisons and other tricks.
> 

You recall correctly about the frame pointer.

Concerning the register usage: I'm no expert in this field, either, but
I think the main issue is not simply register allocation but branch and
exception prediction and so on. The compiler can either optimize for a
seamless continuation if the jump happens or if it doesn't. A human or a
just-in-time compiler can better handle these cases by predicting the
outcome of -- in the case of a JIT -- analyze the outcome of the first
few iterations.

OT: IIRC, register reuse is also the main performance problem of
state-of-the-art javascript engines, at the moment. Concerning the code
they compile at runtime, they are nearly as good as `gcc -O0` but they
have the same problem concerning registers (GCC with -O0 produces code
that works exactly as you describe above: Storing the result after every
computation and loading it again).

> Assembler optimisation is only used on sections of code that are at
> the core of a loop - that are called hundreds or thousands (even
> millions?) of times during the program's execution. It's not for
> code, such as reading the .config file or initialisation, which is
> only called once. Because the code in the core of the loop is called
> so often, you don't have to achieve much of an optimisation for the
> aggregate to be much more considerable.
> 
> The operations in question may only be constitute a few lines of C,
> or a handful of machine operations, so it boils down to an algorithm
> that a human programmer is capable of getting a grip on and
> comprehending. Whilst compilers are clearly more efficient for large
> programs, on this micro scale, humans are more clever and creative
> than machines.
> 
> Encryption / decryption is an example of code that lends itself to
> this kind of optimisation. In particular AES was designed, I believe,
> to be amenable to implementation in this way. The reason for that was
> that it was desirable to have it run on embedded devices and on
> dedicated chips. So it boils down to a simple bitswap operation (??)
> - the plaintext is modified by the encryption key, input and output
> as a fast stream. Each byte goes in, each byte goes out, the same
> function performed on each one.
> 

Well, sort of. First of, you are right, AES was designed with hardware
implementations in mind.

The algorithm boils down to a number of substitution and permutation
networks and XOR operations (I assume that's what you meant with byte
swap). If you look at the portable C code
(/usr/src/linux/crypto/aes_generic.c), you can see that it mostly
consists of lookup tables and XORs.

The thing about "each byte goes in, each byte goes out", however, is a
bit wrong. What you think of is a stream cipher like RC4. AES is a block
cipher. These use an (in this case 128 bit long) input string and XOR it
with the encryption (sub-)key and shuffle it around according to the
exact algorithm.

> Another operation that lends itself to assembler optimisation is
> video decoding - the video is encoded only once, and then may be
> played back hundreds or millions of times by different people. The
> same operations must be repeated a number of times on each frame,
> then c 25 - 60 frames are decoded per second, so at least 90,000
> frames per hour. Again, the smallest optimisation is worthwhile.
> 
> Stroller.
> 
> 


Attachment: signature.asc
Description: OpenPGP digital signature

Reply via email to