so, assuming 500 oversampled svalues is equivalent to perhaps 300 'good' values.... depending on decay... so 300 singular values would require 300 passes over the whole input? or only sub-part of it? Given it takes about 20 s just to set up a MR run and 10 sec to confirm it's completion, that's just what... about 100-150 minutes just in initialization time?
Also, the size of the problem must also affect sorting i/o time (unless all jobs are map-only, but i don't think they can be). That's kind of at least proportional to the size of the input. so I guess problem size does matter, not just the # of available slots for the mappers. On Wed, Apr 6, 2011 at 11:16 AM, Jake Mannix <jake.man...@gmail.com> wrote: > Hmmm... that's a really tiny data set. Lanczos-based SVD, for k singular > values, requires k passes over the data, and each row which has d non-zero > entries will do d^2 computations in each pass. So if there are n rows in > the > data set, it's k*n*d^2 if all rows are the same size. > I guess "how long" depends on how big the cluster is! > > On Wed, Apr 6, 2011 at 11:14 AM, Dmitriy Lyubimov <dlie...@gmail.com> wrote: >> >> Jake, since we are on the topic, what's the running times of Lanczos >> on a ~1G worth sequence file input might be? >> >> On Wed, Apr 6, 2011 at 11:11 AM, Jake Mannix <jake.man...@gmail.com> >> wrote: >> > >> > >> > On Thu, Mar 24, 2011 at 11:03 PM, Dmitriy Lyubimov <dlie...@gmail.com> >> > wrote: >> >> >> >> you can certainly try to write it out into a DRM (distributed row >> >> matrix) and run stochastic SVD on hadoop (off the trunk now). see >> >> MAHOUT-593. This is suitable if you have a good decay of singular >> >> values (but if you don't it probably just means you have so much noise >> >> that it masks the problem you are trying to solve in your data). >> > >> > You don't need to run it as stochastic, either. The regular >> > LanczosSolver >> > will work on this data, if it lives as a DRM. >> > >> > -jake > >