Acting upon the advice that replacing matrix-matrix multiplications in vectorized form with loops would help with performance, I chopped out a piece of code from my finite element solver (https://gist.github.com/anonymous/4ec426096c02faa4354d) and ran some tests with the following results:
Vectorized code: elapsed time: 0.326802682 seconds (134490340 bytes allocated, 17.06% gc time) Loops code: elapsed time: 4.681451441 seconds (997454276 bytes allocated, 9.05% gc time) SLOWER and using MORE memory?! I must be doing something terribly wrong. Petr