On Friday, 24 May 2019 at 12:24:02 UTC, Alex wrote:
So it seems like a quarter-wave LUT is 27 times faster than
sin…
If so then that is great and what I'd expected to achieve
originally.
I guess this is using LDC though? I wasn't able to compile with
LDC since after updating I'm getting linker errors that I have
to go figure out.
Yes, the gist linked above is just your code with minor changes,
that was 4-5 times faster. To get to 27 times faster you need to
use the integer bit-manipulation scheme that I suggest above.
Just beware that I haven't checked the code, so it might be off
by ±1 and such.
Anyway, it is more fun for you to code up your own version than
to try to figure out mine. Just follow the principles and you
should get close to that performance, I think. (I'll refine the
code later, but don't really have time now)
You just have to make sure that the generated instructions
fills the entire CPU pipeline.
What exactly does this mean? I realize the pipeline in cpu's is
how the cpu decodes and optimizes the instructions but when you
say "You have to make sure" that pre-supposes there is a method
or algorithm to know.
Yes, you have to look up information about the CPU in your
computer. Each core has a set of "lanes" that are computed
simultanously. Some instructions can go into many lanes, but not
all. Then there might be bubbles in the pipeline (the lane) that
can be filled up with integer/bit manipulation instructions. It
is tedious to look that stuff up. So, last resort. Just try to
mix simple integer with simple double computations (avoid
division).
Are you saying that I did not have enough instructions that the
pipeline could take advantage of?
Yes, you most likely got bubbles. Empty space where the core has
nothing to send down a lane, because it is waiting for some
computation to finish so that it can figure out what to do next.
Basic optimization:
Step 1: reduce dependencies between computations
Step 2: make sure you generate a mix of simple integer/double
instructions that can fill up all the computation lanes at the
same time
Step 3: make sure loops only contain a few instructions, the CPU
can unroll loops in hardware if they are short. (not valid here
though)
Of course, a lot of that might simply be due to LDC and I
wasn't able to determine this.
I think I got better performance because I filled more lanes in
the pipeline.
Half sin was done above but quarter sine can be used(there are
4 quadrants but only one has to be tabularized because all the
others differ by sign and reversion(1 - x), it's a matter of
figuring out the sign).
Yes, as I mentioned, the first bit of the phase is the sign and
the second bit of the phase is the reversion of the indexing.
Of course it requires extra computation so it would be
interesting to see the difference in performance for the extra
logic.
It adds perhaps 2-5 cycles or so, my guessing.
exp(x) can be written as exp(floor(x) + {x}) =
exp(floor(x))*exp({x})
[...]
With linear interpolation one can get very accurate(for all
practical purposes) LUT table methods that, if your code is
right, is at least an order of magnitude faster. The memory
requirements will be quite small with linear interpolation
I think you need to do something with the x before you look up,
so that you have some kind of fast nonlinear mapping to the
indexes.
But for exp() you might prefer an approximation instead, perhaps
polynomial taylor series perhaps.
Searching the web should give some ideas.
It seems you already have the half-sin done.
I did the quarter sin though, not the half-sin (but that is
almost the same, just drop the reversion of the indexing).
(Let's talk about this later, since we both have other things on
our plate. Fun topic! :-)