On Sun, Sep 27, 2009 at 9:37 PM, Ted Dunning <[email protected]> wrote:
> > So ... if R is defined by a PRNG with well defined seeds for each row, then > you don't have to store it. It is particularly nice if you can compute the > n-th random number from a particular seed at high speed. If you can do > that, then multiplication by a sparse matrix is very clean. You can > probably do that with most generators by building a moral equivalent of a > skip list ... just save the seed every few hundred elements. Better to > have > a generator that can be iterated more efficiently, but usable. Note the > comments in the articles about how very poor generators work just fine. > Considering the PRNG is just needed to get vectors which aren't linearly dependent, it seems like even using a completely random-access random stream (like M_i,j = new Random(seed ^ (i * numRows + j)).nextInt() ) should work, requiring no memory at all. Now in a MR setting, A will probably be split into row-wise patches. y = AR > is the row-wise concatenation of these. With the PRNG trick, each mapper > can have an implicit copy of R. Otherwise, it will have to be distributed > as a side file to all mappers and a merge will be necessary. But if we are > after y' y as well, we can compute this by taking outer products of each > row > of y and summing the resulting k x k matrices. This works wonderfully > because combiners and such can combine partial map outputs. You probably > still have to store y itself, but with a few fancy keys and a multiple > output format, you should be good to go. > When decomposing y'y, I tend to use a similar trick to this, which is probably equivalent: to multiply matrix by vector (y'y)*u, just do the vector sum of y_i * (y_i.dot(u)) over the rows y_i of y (in which case you store y, but never bother to store y'y - and just like the outer-product sum, the vector sum can be done partially in the combiners as well). Of course, y'y is much smaller than y, so avoiding saving y'y is not really that much of a win - but or this algorithm, it seems you do probably need both y and y'y, in which case it might be nice to not have to ever really deal with y'y explicitly. -jake
