One more note: I conjectured that perhaps the compiler was not able to infer correctly the type of the matrices, so I hardwired (in the actual FE code)
Jac = 1.0; gradN = gradNparams[j]/(J); # get rid of Rm for the moment About 10% less memory used, runtime about the same. So, no effect really. Loops are still slower than the vectorized code by a factor of two. Petr