Am 13.03.2012 20:07, schrieb Stroller: > > On 13 March 2012, at 18:18, Michael Mol wrote: >> ... >>> So I assume the i586 version is better for you --- unless GCC >>> suddenly got a lot better at optimizing code. >> >> Since when, exactly? GCC isn't the best compiler at optimization, >> but I fully expect current versions to produce better code for >> x86-64 than hand-tuned i586. Wider registers, more registers, >> crypto acceleration instructions and SIMD instructions are all very >> nice to have. I don't know the specifics of AES, though, or what >> kind of crypto algorithm it is, so it's entirely possible that one >> can't effectively parallelize it except in some relatively unique >> circumstances. > > Do you have much experience of writing assembler? > > I don't, and I'm not an expert on this, but I've read the odd blog > article on this subject over the years. > > What I've read often has the programmer looking at the compiled gcc > bytecode and examining what it does. The compiler might not care how > many registers it uses, and thus a variable might find itself > frequently swapped back into RAM; the programmer does not have any > control over the compiler, and IIRC some flags reserve a register for > degugging (IIRC -fomit-frame-pointer disables this). I think it's > possible to use registers more efficiently by swapping them (??) or > by using bitwise comparisons and other tricks. >
You recall correctly about the frame pointer. Concerning the register usage: I'm no expert in this field, either, but I think the main issue is not simply register allocation but branch and exception prediction and so on. The compiler can either optimize for a seamless continuation if the jump happens or if it doesn't. A human or a just-in-time compiler can better handle these cases by predicting the outcome of -- in the case of a JIT -- analyze the outcome of the first few iterations. OT: IIRC, register reuse is also the main performance problem of state-of-the-art javascript engines, at the moment. Concerning the code they compile at runtime, they are nearly as good as `gcc -O0` but they have the same problem concerning registers (GCC with -O0 produces code that works exactly as you describe above: Storing the result after every computation and loading it again). > Assembler optimisation is only used on sections of code that are at > the core of a loop - that are called hundreds or thousands (even > millions?) of times during the program's execution. It's not for > code, such as reading the .config file or initialisation, which is > only called once. Because the code in the core of the loop is called > so often, you don't have to achieve much of an optimisation for the > aggregate to be much more considerable. > > The operations in question may only be constitute a few lines of C, > or a handful of machine operations, so it boils down to an algorithm > that a human programmer is capable of getting a grip on and > comprehending. Whilst compilers are clearly more efficient for large > programs, on this micro scale, humans are more clever and creative > than machines. > > Encryption / decryption is an example of code that lends itself to > this kind of optimisation. In particular AES was designed, I believe, > to be amenable to implementation in this way. The reason for that was > that it was desirable to have it run on embedded devices and on > dedicated chips. So it boils down to a simple bitswap operation (??) > - the plaintext is modified by the encryption key, input and output > as a fast stream. Each byte goes in, each byte goes out, the same > function performed on each one. > Well, sort of. First of, you are right, AES was designed with hardware implementations in mind. The algorithm boils down to a number of substitution and permutation networks and XOR operations (I assume that's what you meant with byte swap). If you look at the portable C code (/usr/src/linux/crypto/aes_generic.c), you can see that it mostly consists of lookup tables and XORs. The thing about "each byte goes in, each byte goes out", however, is a bit wrong. What you think of is a stream cipher like RC4. AES is a block cipher. These use an (in this case 128 bit long) input string and XOR it with the encryption (sub-)key and shuffle it around according to the exact algorithm. > Another operation that lends itself to assembler optimisation is > video decoding - the video is encoded only once, and then may be > played back hundreds or millions of times by different people. The > same operations must be repeated a number of times on each frame, > then c 25 - 60 frames are decoded per second, so at least 90,000 > frames per hour. Again, the smallest optimisation is worthwhile. > > Stroller. > >
signature.asc
Description: OpenPGP digital signature