On 12/08/2014 01:35 PM, Stefan Sullivan wrote:
Hey music DSP folks,
I'm wondering if anybody knows much about using these open source
compilers
to compile to various DSP architectures (e.g. SHARC, ARM, TI, etc).
I have some experience with ARM Cortex-M4, using fixed point. Everything
in this message is specific only to Cortex-M4, and might apply to the
upcoming Cortex-M7, but it's unrelated to other processors.
ARM's marketing material makes a lot of carefully worded claims about
"DSP extensions" which provide "up to a 2X speed improvement". That is
true. But many people mistakenly leap to the conclusion that there's a
DSP co-processor or something resembling a "DSP architecture" in the
chip. In traditional DSP, you'd expect simultaneous fetching of data
and coefficient, multiply-accumulate and loop counting. Cortex-M4 has
nothing like that. It's still very much a traditional microprocessor
where those operations are separate instructions.
The DSP extensions are designed for 16 bit fixed point data, which you
pack into the native 32 bit memory and registers. Much of the 2X
speedup occurs in the loading and storing of data. With ARM's
traditional instructions, 16 bit data is sign extended into 32 bit
registers on load, and those upper 16 bits are discarded when storing it
back to memory. Cortex-M4 also has an optimization in hardware where it
detects successive instructions performing similar load or store and
combines them into a single burst access on the bus, so the 2nd, 3rd,
4th access take only a single cycle. In traditional DSP, the
architecture loads data in the same cycle as the math is performed. At
best, ARM's extension gets the loading overhead close to 0.5 cycles per
16 bit word.
Actually using the DSP extensions requires keen awareness and planning
of the ARM register usage. As far as I know, the only way to cause gcc
to use them is inline assembly, which is usually wrapped with inline
functions or preprocessor macros. ARM's marketing material makes a lot
of claims about how only C programming is needed. While that's
technically true, given an already-written header file with the inline
assembly (some commercial compilers have "intrinsics" which are
basically the same thing), the honest truth is assembly code is
involved. Really leveraging these instructions requires careful
planning of how many registers you'll use to bring in packed pairs of
samples, how many will hold your intermediate calculations, loop
counters, pointers, and other overhead. If you exceed the 12 or 13
available ARM 32 bit registers, the compiler needs to spill variables
onto the stack, which ruins any speed benefit you might hope to achieve
by going to so much effort to use the DSP extensions.
Another feature of DSP fixed point architectures is automatic saturation
(clipping) during addition. This too is usually done with a separate
instruction on ARM. They do provide a couple add instructions with
automatic saturation, but pervasive support for saturation during all
calculations is not present.
Looping overhead is also still an issue. Typically, you would compose
your code to process 4, 8, or 16 samples in each loop iteration. That
lets you use the pipeline burst to bring the packed samples in to 2, 3
or 4 registers. Then you'd unroll your code, placing 4, 8 or 16 copies
of whatever math you're doing, and store the results to the output
buffer, taking advantage of the pipeline burst for writing. Then you'd
suffer looping overhead, which isn't so bad if you're processing 8 or 16
samples per iteration.
I've written a lot about code structure, planning of data packing, and
register allocation allocation, so far without any specifics of the
actual operations, for a good reason. Really using the DSP extension is
like this. You spend almost all the time (or at least I do) planning
this stuff, so you can actually take advantage of the narrow but useful
features those instructions provide.
The actual instructions are documented in the ARM v7-M reference manual
(ARM document DDI0403D), starting on page 133, section A4.4.3.
Probably the most interesting instruction is SMLALD & SMLALDX. It
performs two 16x16 signed multiplies and adds both products a signed 64
bit accumulator. The 4 numbers to multiply have to be packed into 2
normal 32 bit ARM registers. SMLALD multiplies the lower halfs together
and the upper halves together, and SMLALDX multiplies the lower half in
one register with the upper half in the other, and vise-versa. No other
combinations are possible, so you must arrange your data appropriately
if you want to get 2 multiply-accumulate in a single cycle. But there
is a version that subtract one of the products. There's also versions
that accumulate to only 32 bits, which give you one extra precious 32
bit register, in cases where you're sure overflow isn't an issue
(remember, these don't automatically saturate if your math overflows the
accumulator).
The other really useful instructions do only a single multiply, but with
16x32 bits, resulting in a 48 bit product, or 32x32 bits, resulting in a
64 bit product, where the low bits are discarded and you get the top 32
bit result into a 32 bit register. Those come in flavors than can also
add to or subtract from a register. Even though these only do a single
multiply per cycle, they're really useful, and they come in pairs which
can take the 16 bit operand from either the bottom or top half of a 32
bit register.
The general idea is bringing 4, 8 or 16 samples packed into 2, 3, or 4
ARM registers, then use these instructions to do math with those 16 bit
numbers, producing 32 or 64 bit results, which you then shift/saturate
down to the 16 bit outputs and pack into 32 bit registers and write out,
before suffering only 1/4, 1/8 or 1/16 the looping overhead of doing
things the simpler C-only way.
Despite the underwhelming nature of these DSP extensions, if used well,
they really do enable processing 16 bit audio pretty well on low cost,
low power Cortex-M4. The gains in load, store, loop unrolling, and
needing fewer instructions do make quite a difference over traditional C
code, even if you never manage to use the 2-mult-per-cycle SMLALD &
SMLALDX instructions.
I realize most people want the simplicity and convenience of using C
without worrying about assembly. Despite what ARM's carefully worded
marketing literature might lead you to believe, the honest reality is
these special instructions really require quite a lot of detail-oriented
programming. It's not quite as tedious as assembly, since the compiler
still takes care of register allocation and code generation, but to
write such code, you really must carefully plan register usage and
mapping to these special instructions. The compiler doesn't do that for
you.
However, there are libraries which have all this stuff done for you. If
your needs are met by the already-written library functions, then you
can reap all the benefit without the pain of such low-level coding. ARM
publishes a CMSIS-DSP library, with a large collection of vector math
operations, including FFT, FIR filters. DSP Concepts publishes an
expensive but very nice audio toolkit (which I have not personally
used). And I recently have been working on an open source Cortex-M4
audio library, which you can find here:
http://www.pjrc.com/teensy/td_libs_Audio.html
I hope this long-winded explanation helps. Despite my best efforts to
keep things non-commerical on mail lists, if you're doing an audio
project where 16 bit fixed point is appropriate, I hope you'll consider
using the free code and inexpensive Teensy products I make. ;-)
Again, all this applies only to Cortex-M4 and ARM's fixed point DSP
extensions.
--
dupswapdrop -- the music-dsp mailing list and website:
subscription info, FAQ, source code archive, list archive, book reviews,
dsp links
http://music.columbia.edu/cmc/music-dsp
http://music.columbia.edu/mailman/listinfo/music-dsp