Maybe I'm missing something obvious, but couldn't you easily write your own 'cross' function that uses a couple nested for-loops to do the arithmetic without any intermediate allocations at all?
On Tuesday, July 7, 2015 at 6:24:34 PM UTC-4, Matthieu wrote: > > Thanks, this is what I currently do :) > > However, I'd like to find a solution that is both memory efficient (X can > be very large) and which does not modify X in place. > > Basically, I'm wondering whether there was a BLAS subroutine that would > allow to compute cross(X, w, Y) in one pass without creating an > intermediate matrix as large as X or Y. > >