so, assuming 500 oversampled svalues is equivalent to perhaps 300
'good' values.... depending on decay... so 300 singular values would
require 300 passes over the whole input? or only sub-part of it?
Given it takes about 20 s just to set up a MR run and 10 sec to
confirm it's completion, that's just what... about 100-150 minutes
just in initialization time?

Also, the size of the problem must also affect sorting i/o time
(unless all jobs are map-only, but i don't think they can be). That's
kind of at least proportional to the size of the input. so I guess
problem size does matter, not just the # of available slots for the
mappers.


On Wed, Apr 6, 2011 at 11:16 AM, Jake Mannix <jake.man...@gmail.com> wrote:
> Hmmm... that's a really tiny data set.  Lanczos-based SVD, for k singular
> values, requires k passes over the data, and each row which has d non-zero
> entries will do d^2 computations in each pass.  So if there are n rows in
> the
> data set, it's k*n*d^2 if all rows are the same size.
> I guess "how long" depends on how big the cluster is!
>
> On Wed, Apr 6, 2011 at 11:14 AM, Dmitriy Lyubimov <dlie...@gmail.com> wrote:
>>
>> Jake, since we are on the topic, what's the running times of Lanczos
>> on a ~1G worth sequence file input might be?
>>
>> On Wed, Apr 6, 2011 at 11:11 AM, Jake Mannix <jake.man...@gmail.com>
>> wrote:
>> >
>> >
>> > On Thu, Mar 24, 2011 at 11:03 PM, Dmitriy Lyubimov <dlie...@gmail.com>
>> > wrote:
>> >>
>> >> you can certainly try to write it out into a DRM (distributed row
>> >> matrix) and run stochastic SVD on  hadoop (off the trunk now). see
>> >> MAHOUT-593. This is suitable if you have a good decay of singular
>> >> values (but if you don't it probably just means you have so much noise
>> >> that it masks the problem you are trying to solve in your data).
>> >
>> > You don't need to run it as stochastic, either.  The regular
>> > LanczosSolver
>> > will work on this data, if it lives as a DRM.
>> >
>> >   -jake
>
>

Reply via email to