It is common with large numerical codes that things run faster in memory on
just a few cores if the communication required outweighs the parallel
speedup.
The issue is that memory bandwidth is slower than the arithmetic speed by a
very good amount. If you just have to move stuff into the CPU and m
This uses the Mahout blas optimizing solver, which I just use and do not know
well. Mahout virtualizes some things having to do with partitioning and I’ve
never quite understood how they work. There is a .par() on one of the matrix
classes that has a similar function to partition but in all case