Actually, maybe what you were thinking (at least, what *I* am thinking) is that you can indeed do it on one pass through the *original* data (ie you can get away with never keeping a handle on the original data itself), because on the "one pass" through that data, you spit out MultipleOutputs - one SequenceFile of the randomly projected data, which doesn't hit a reducer at all, and a second output which is the outer product of those vectors with themselves, which its a summing reducer.
In this sense, while you need to pass over the original data's *size* (in terms of number of rows) a second time, if you want to consider it data to be played with (instead of just "training" data for use on a smaller subset or even totally different set), you don't need to pass over the original entire data *set* ever again. -jake On Mon, Mar 22, 2010 at 6:35 PM, Ted Dunning <ted.dunn...@gmail.com> wrote: > You are probably right. I had a wild hare tromp through my thoughts the > other day saying that one pass should be possible, but I can't reconstruct > the details just now. > > On Mon, Mar 22, 2010 at 6:00 PM, Jake Mannix <jake.man...@gmail.com> > wrote: > > > I guess if you mean just do a random projection on the original data, you > > can certainly do that in one pass, but that's random projection, not a > > stochastic decomposition. > > >