On Wed, Apr 6, 2011 at 11:26 AM, Dmitriy Lyubimov <dlie...@gmail.com> wrote:
> so, assuming 500 oversampled svalues is equivalent to perhaps 300 > 'good' values.... depending on decay... so 300 singular values would > require 300 passes over the whole input? or only sub-part of it? > Given it takes about 20 s just to set up a MR run and 10 sec to > confirm it's completion, that's just what... about 100-150 minutes > just in initialization time? > In general, yes, DistributedLanczosSolver is dominated by startup costs for nearly all data sets I've used. > Also, the size of the problem must also affect sorting i/o time > (unless all jobs are map-only, but i don't think they can be). That's > And they're not map-only, there is a shuffle on every pass, but the combiners are pretty will utilized, so the shuffle is pretty small. > kind of at least proportional to the size of the input. so I guess > problem size does matter, not just the # of available slots for the > mappers. > > > On Wed, Apr 6, 2011 at 11:16 AM, Jake Mannix <jake.man...@gmail.com> > wrote: > > Hmmm... that's a really tiny data set. Lanczos-based SVD, for k singular > > values, requires k passes over the data, and each row which has d > non-zero > > entries will do d^2 computations in each pass. So if there are n rows in > > the > > data set, it's k*n*d^2 if all rows are the same size. > > I guess "how long" depends on how big the cluster is! > > > > On Wed, Apr 6, 2011 at 11:14 AM, Dmitriy Lyubimov <dlie...@gmail.com> > wrote: > >> > >> Jake, since we are on the topic, what's the running times of Lanczos > >> on a ~1G worth sequence file input might be? > >> > >> On Wed, Apr 6, 2011 at 11:11 AM, Jake Mannix <jake.man...@gmail.com> > >> wrote: > >> > > >> > > >> > On Thu, Mar 24, 2011 at 11:03 PM, Dmitriy Lyubimov <dlie...@gmail.com > > > >> > wrote: > >> >> > >> >> you can certainly try to write it out into a DRM (distributed row > >> >> matrix) and run stochastic SVD on hadoop (off the trunk now). see > >> >> MAHOUT-593. This is suitable if you have a good decay of singular > >> >> values (but if you don't it probably just means you have so much > noise > >> >> that it masks the problem you are trying to solve in your data). > >> > > >> > You don't need to run it as stochastic, either. The regular > >> > LanczosSolver > >> > will work on this data, if it lives as a DRM. > >> > > >> > -jake > > > > >