On 20/06/12 14:51, Manu wrote:
On 20 June 2012 14:44, Don Clugston <d...@nospam.com
<mailto:d...@nospam.com>> wrote:

    On 20/06/12 13:04, Manu wrote:

        On 20 June 2012 13:51, Don Clugston <d...@nospam.com
        <mailto:d...@nospam.com>

        <mailto:d...@nospam.com <mailto:d...@nospam.com>>> wrote:

            On 19/06/12 20:19, Iain Buclaw wrote:

                Hi,

                Had round one of the code review process, so I'm going
        to post
                the main
                issues here that most affect D users / the platforms
        they want
                to run on
                / the compiler version they want to use.



                1) D Inline Asm and naked function support is raising
        far too
                many alarm
                bells. So would just be easier to remove it and avoid
        all the other
                comments on why we need middle-end and backend headers
        in gdc.


            You seem to be conflating a couple of unrelated issues here.
            One is the calling convention. The other is inline asm.

            Comments in the thread about "asm is mostly used for short
        things
            which get inlined" leave me completely baffled, as it is
        completely
            wrong.

            There are two uses for asm, and they are very different:
            (1) Functionality. This happens when there are gaps in the
        language,
            and you get an abstraction inversion. You can address these with
            intrinsics.
            (2) Speed. High-speed, all-asm functions. These _always_
        include a loop.


            You seem to be focusing on (1), but case (2) is completely
        different.

            Case (2) cannot be replaced with intrinsics. For example,
        you can't
            write asm code using MSVC intrinsics (because the compiler
        rewrites
            your code).
            Currently, D is the best way to write (2). It is much, much
        better
            than an external assembler.


        Case 1 has no alternative to inline asm. I've thrown out some crazy
        ideas to think about (but nobody seems to like them). I still
        think it
        could be addressed though.

        Case 2; I'm not convinced. These such long functions are the
        type I'm
        generally interested in aswell, and have the most experience
        with. But
        in my experience, they're almost always best written with
        intrinsics.
        If they're small enough to be inlined, then you can't afford not
        to use
        intrinsics. If they are truly big functions, then you begin to
        sacrifice
        readability and maintain-ability, and certainly limit the number of
        programmers that can maintain the code.


    I don't agree with that. In the situations I'm used to, using
    intrinsics would not make it easier to read, and would definitely
    not make it easier to maintain. I find it inconceivable that
    somebody could understand the processor well enough to maintain the
    code, and yet not understand asm.


These functions of yours are 100% asm, that's not really what I would
usually call 'inline asm'. That's really just 'asm' :)
I think you've just illustrated one of my key points actually; that is
that you can't just insert small inline asm blocks within regular code,
the optimiser can't deal with it in most cases, so inevitably, the
entire function becomes asm from start to end.

Personally I call it "inline asm" if I don't need to use a separate assembler. If you're using a different definition, then we don't actually disagree.


I find I can typically produce equivalent code using carefully crafted
intrinsics within regular C language structures. Also, often enough, the
code outside the hot loop can be written in normal C for readability,
since it barely affects performance, and trivial setup code will usually
optimise perfectly anyway.

You're correct that a person 'maintaining' such code, who doesn't have
such a thorough understanding of the codegen may ruin it's perfectly
tuned efficiency. This may be the case, but in a commercial coding
environment, where a build MUST be delivered yesterday, the guy that
understands it is on holiday, and you need to tweak the behaviour
immediately, this is a much safer position to be in.
This is a very real scenario. I can't afford to ignore this practical
reality.

OK, it sounds like your use case is a bit different. The kinds of things I deal with are

I might have a go at compiling the regular D code tonight, and seeing if
I can produce identical assembly. I haven't tried this so much with x86
as I have with RISC architectures, which have much more predictable codegen.


        I rarely fail to produce identical code with intrinsics to that
        which I
        would write with hand written asm. The flags are always the biggest
        challenge, as discussed prior in this thread. I think that could be
        addressed with better intrinsics.


    Again, look at std.internal.math.BiguintX86. There are many cases
    there where you can swap two instructions, and the code will still
    produce the correct result, but it will be 30% slower.


But that's precisely the sort of thing optimisers/schedulers are best
at. Can you point at a particular example where that is the case, that
the scheduler would get it wrong if left to its own ordering algorithm?
The opcode tables should have thorough information about the opcode
timings and latencies.

I don't know. I can just tell you that they don't get it right. I suspect they don't take all of the bottlenecks into account.

For x86 I think the primary difficulty is that you cannot do it in independent passes. Eg, you won't find a register contention bottleneck until you've assigned registers, and the only way to get rid of it is to change the instructions you're using. Which involves backtracking through several passes. Very messy.

The only thing that I find usually trips it up is
not having knowledge of the probability of the data being in nearby
cache. If it has 2 loads, and one is less likely to be in cache, it
should be scheduled earlier.

Yes, that's definitely true.


As a side question, x86 architectures perform wildly differently from
each other. How do you reliably say some block of hand written x86 code
is the best possible code on all available processors?
Do you just benchmark on a suite of common processors available at the
time? I can imagine the opcode timing tables, which are presumably
rather different for every cpu, could easily feed wrong data to the
codegen...

Yes. You can fairly easily determine a theoretical limit for a piece of code, and if you've reached that, you're optimal.

It's not possible to be simultaneously optimal on Pentium4 and something else, but my experience is that code optimized for PPro-series Intel machines is usually near-optimal on AMD. (The reverse is not true, it's much easier to be optimal on AMD).


    I think that the SIMD case gives you a misleading impression,
    because on x86 they are very easy to schedule (they nearly all take
    the same number of cycles, etc). So it's not hard for the compiler
    to do a good job of it.


True, but it's one of the most common usage scenarios, so it can't be
ignored. Some other case studies I feel close to are hardware emulation,
software rasterisation, particles, fluid dynamics, rigid body dynamics,
FFT's, and audio signal processing. In each, the only time I rarely need
inline asm, usually only when there is a hole in the high level
language, as you said earlier. I find this typically surfaces when
needing to interact with the flags regs directly.

I agree with that. I think the need for asm in those cases could be greatly reduced. I'm just saying that there are cases where eliminating asm is not realistic.

Reply via email to