for completeness: with the inner loops now going through the first index as suggested by Jeff, there was another increase in speed. So now I stand at *16.8 s* on average with julia.
The same thing in python/numpy takes roughly *6800 s* to run (however not vectorized in numpy, using for loops as in the examples above)