Am 13.03.2012 20:38, schrieb Michael Mol:
> On Tue, Mar 13, 2012 at 3:07 PM, Stroller 
> <strol...@stellar.eclipse.co.uk> wrote:
>> 
>> On 13 March 2012, at 18:18, Michael Mol wrote:
>>> ...
>>>> So I assume the i586 version is better for you --- unless GCC
>>>> suddenly got a lot better at optimizing code.
>>> 
>>> Since when, exactly? GCC isn't the best compiler at optimization,
>>> but I fully expect current versions to produce better code for
>>> x86-64 than hand-tuned i586. Wider registers, more registers,
>>> crypto acceleration instructions and SIMD instructions are all
>>> very nice to have. I don't know the specifics of AES, though, or
>>> what kind of crypto algorithm it is, so it's entirely possible
>>> that one can't effectively parallelize it except in some
>>> relatively unique circumstances.
>> 
>> Do you have much experience of writing assembler?
>> 
>> I don't, and I'm not an expert on this, but I've read the odd blog
>> article on this subject over the years.
> 
> Similar level of experience here. I can read it, even debug it from 
> time to time. A few regular bloggers on the subject are like candy. 
> And I used to have pagetable.org, Ars's Technopaedia and specsheets 
> for early x86 and motorola processors memorized. For the past couple 
> years, I've been focusing on reading blogs of language and compiler 
> authors, academics involved in proofing, testing and improving them, 
> etc.
> 
>> 
>> What I've read often has the programmer looking at the compiled gcc
>> bytecode and examining what it does. The compiler might not care
>> how many registers it uses, and thus a variable might find itself
>> frequently swapped back into RAM; the programmer does not have any
>> control over the compiler, and IIRC some flags reserve a register
>> for degugging (IIRC -fomit-frame-pointer disables this). I think
>> it's possible to use registers more efficiently by swapping them
>> (??) or by using bitwise comparisons and other tricks.
> 
> Sure; it's cheaper to null out a register by XORing it with itself 
> than setting it to 0.
> 
>> 
>> Assembler optimisation is only used on sections of code that are at
>> the core of a loop - that are called hundreds or thousands (even
>> millions?) of times during the program's execution. It's not for
>> code, such as reading the .config file or initialisation, which is
>> only called once. Because the code in the core of the loop is
>> called so often, you don't have to achieve much of an optimisation
>> for the aggregate to be much more considerable.
> 
> Sure; optimize the hell out of the code where you spend most of your 
> time. I wasn't aware that gcc passed up on safe optimization 
> opportunities, though.
> 
>> 
>> The operations in question may only be constitute a few lines of C,
>> or a handful of machine operations, so it boils down to an
>> algorithm that a human programmer is capable of getting a grip on
>> and comprehending. Whilst compilers are clearly more efficient for
>> large programs, on this micro scale, humans are more clever and
>> creative than machines.
> 
> I disagree. With defined semantics for the source and target, a 
> computer's cleverness is limited only by the computational and
> memory expense of its search algorithms. Humans get through this by
> making habit various optimizations, but those habits become less
> useful as additional paths and instructions are added. As system
> complexity increases, humans operate on personally cached techniques
> derived from simpler systems. I would expect very, very few people to
> be intimately familiar with the the majority of optimization
> possibilities present on an amdfam10 processor or a core2. Compiler's
> aren't necessarily familiar with them, either; they're just quicker
> at discovering them, given knowledge of the individual instructions
> and the rules of language semantics.
> 
>> 
>> Encryption / decryption is an example of code that lends itself to
>> this kind of optimisation. In particular AES was designed, I
>> believe, to be amenable to implementation in this way. The reason
>> for that was that it was desirable to have it run on embedded
>> devices and on dedicated chips. So it boils down to a simple
>> bitswap operation (??) - the plaintext is modified by the
>> encryption key, input and output as a fast stream. Each byte goes
>> in, each byte goes out, the same function performed on each one.
> 
> I'd be willing to posit that you're right here, though if there
> isn't a per-byte feedback mechanism, SIMD instructions would come
> into serious play. But I expect there's a per-byte feedback
> mechanism, so parallelization would likely come in the form of
> processing simultaneous streams.
> 
>> 
>> Another operation that lends itself to assembler optimisation is
>> video decoding - the video is encoded only once, and then may be
>> played back hundreds or millions of times by different people. The
>> same operations must be repeated a number of times on each frame,
>> then c 25 - 60 frames are decoded per second, so at least 90,000
>> frames per hour. Again, the smallest optimisation is worthwhile.
> 
> Absolutely. My position, though, is that compilers are quicker and 
> more capable of discovering optimization possibilities than humans 
> are, when the target architecture changes. Sure, you've got several 
> dozen video codecs in, say, ffmpeg, and perhaps it all boils down to 
> less than a dozen very common cases of inner loop code. With 
> hand-tuned optimization, you'd need to fork your assembly patch for 
> each new processor feature that comes out, and then work to find the 
> most efficient way to execute code on that processor.
> 
> There's also cases where processor features get changed. I don't 
> remember the name of the instruction (it had something to do with 
> stack operations) in x86, but Intel switched it from a 0-cycle 
> instruction to something more expensive. Any code which assumed that 
> instruction was a 0-cycle instruction now became less efficient. A 
> compiler (presuming it has a knowledge of the target processor's 
> instruction set properties) would have an easier time coping with
> that change than a human would.
> 
> I'm not saying humans are useless; this is just one of those areas 
> which is sufficiently complex-yet-deterministic that sufficient 
> knowledge of the source and target environments would give a
> computer the edge over a human in finding the optimal sequence of
> CPU instructions.
> 

This thread is becoming ridiculously long. Just as a last side-note:

One of the primary reasons that the IA64 architecture failed was that it
relied on the compiler to optimize the code in order to exploit the
massive instruction-level parallelism the CPU offered. Compilers never
became good enough for the job. Of course, that happended in the
nineties and we have much better compilers now (and x86 is easier to
handle for compilers). But on the other hand: That was Intel's next big
thing and if they couldn't make the compilers work, I have no reason to
believe in their efficiency now.

Regards,
Florian Philipp

Attachment: signature.asc
Description: OpenPGP digital signature

Reply via email to