r2 = r1 + r3, r4 = dm(i0,m1);  /* addition and memory access */
Yep. In my answer to Florian I forgot that (other than ARM) the Blackfin can do a calculation and a memory access in a single instruction cycle. That explains the much better performance even with standard (non-DSP-alike) tasks.
  r3 = r2 * r4, r1 = r2 + r4;    /* multiplication and addition */
I did not know yet that it can do two independent 32 bit calculations and that it can do 32 bit multiplications. Anyway, even if only two 32 additions can be done in one instruction cycle this is a big chance for optimization.
A totally different topic is the inherent parallel processing of a DSP.
Usually they can utilize several processing units (+, *) and memories
within a single cycle (e.g. see above). Instruction ordering and
interleaving to utilize parallelism is tedious to do by hand and I think
also challenging for a compiler.
Maybe a first version could skip the great chances for optimization and just do a single operation per instruction cycle.

It should be able to create a working compiler that way.

-Michael
_______________________________________________
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel

Reply via email to