Re: Performance of tables slower than built in?

Ola Fosheim Grøstad via Digitalmars-d-learn Fri, 24 May 2019 06:00:39 -0700

On Friday, 24 May 2019 at 12:24:02 UTC, Alex wrote:

So it seems like a quarter-wave LUT is 27 times faster thansin…
If so then that is great and what I'd expected to achieveoriginally.
I guess this is using LDC though? I wasn't able to compile withLDC since after updating I'm getting linker errors that I haveto go figure out.

Yes, the gist linked above is just your code with minor changes,that was 4-5 times faster. To get to 27 times faster you need touse the integer bit-manipulation scheme that I suggest above.Just beware that I haven't checked the code, so it might be offby ±1 and such.

Anyway, it is more fun for you to code up your own version thanto try to figure out mine. Just follow the principles and youshould get close to that performance, I think. (I'll refine thecode later, but don't really have time now)

You just have to make sure that the generated instructionsfills the entire CPU pipeline.
What exactly does this mean? I realize the pipeline in cpu's ishow the cpu decodes and optimizes the instructions but when yousay "You have to make sure" that pre-supposes there is a methodor algorithm to know.

Yes, you have to look up information about the CPU in yourcomputer. Each core has a set of "lanes" that are computedsimultanously. Some instructions can go into many lanes, but notall. Then there might be bubbles in the pipeline (the lane) thatcan be filled up with integer/bit manipulation instructions. Itis tedious to look that stuff up. So, last resort. Just try tomix simple integer with simple double computations (avoiddivision).

Are you saying that I did not have enough instructions that thepipeline could take advantage of?

Yes, you most likely got bubbles. Empty space where the core hasnothing to send down a lane, because it is waiting for somecomputation to finish so that it can figure out what to do next.


Basic optimization:

Step 1: reduce dependencies between computations

Step 2: make sure you generate a mix of simple integer/doubleinstructions that can fill up all the computation lanes at thesame time

Step 3: make sure loops only contain a few instructions, the CPUcan unroll loops in hardware if they are short. (not valid herethough)

Of course, a lot of that might simply be due to LDC and Iwasn't able to determine this.

I think I got better performance because I filled more lanes inthe pipeline.

Half sin was done above but quarter sine can be used(there are4 quadrants but only one has to be tabularized because all theothers differ by sign and reversion(1 - x), it's a matter offiguring out the sign).

Yes, as I mentioned, the first bit of the phase is the sign andthe second bit of the phase is the reversion of the indexing.

Of course it requires extra computation so it would beinteresting to see the difference in performance for the extralogic.


It adds perhaps 2-5 cycles or so, my guessing.

exp(x) can be written as exp(floor(x) + {x}) =exp(floor(x))*exp({x})

[...]

With linear interpolation one can get very accurate(for allpractical purposes) LUT table methods that, if your code isright, is at least an order of magnitude faster. The memoryrequirements will be quite small with linear interpolation

I think you need to do something with the x before you look up,so that you have some kind of fast nonlinear mapping to theindexes.

But for exp() you might prefer an approximation instead, perhapspolynomial taylor series perhaps.


Searching the web should give some ideas.

It seems you already have the half-sin done.

I did the quarter sin though, not the half-sin (but that isalmost the same, just drop the reversion of the indexing).

(Let's talk about this later, since we both have other things onour plate. Fun topic! :-)

Re: Performance of tables slower than built in?

Reply via email to