I don't agree with k>10 being unlikely meaningful. I've used SVD in text
mining problems where k~150 yielded best results (not only a good choice
based on plotting eigenvalues and seeing elbow in decay was near 150 but
checking results with different k's and seeing around 150 made much more
sense). Currently I'm working in a recommender system and already have
Lanczos running with k~50 producing best results, again, based on visual
exploration of eigenvalues and exploring results one by one and seeing they
were more meaningful. Current tests with SSVD are based on the latter and
when I say I'm not getting good results I mean Lanczos is working properly
on the same problem (I've explored eigenvalues up to 150 and have a good
decay) and SSVD is not, but as I said, this might be caused by some bug in
the input process, seems to strange to me that results are so different so
I'll get back to this discussions when I figure it out :) . If you are
curious about the numbers: 1MM rows by 150k columns for text mining case
and 18 MM rows by 80k columns for recommender.

About p and q, I have been playing around with movielens 100k dataset and
found q>0 actually worsens results in terms of precision (nothing severe
though, but it happens) and its better to increase p a little in that
particular case, so my guess is it depends a lot on the dataset though I
don't know how.


2013/8/2 Dmitriy Lyubimov <dlie...@gmail.com>

> the only time you would not get good results is if spectrum does not have a
> good decay. Which is equivalent to mostly same variance in most of original
> basis directions. This problem is similar to problem that arises with PCA
> when you try to do dimensionality reduction with retaining certain %-tage
> of variance. in case of flat spectrum decay, you'd need much bigger k to
> retain same amount of variance in dimensionally reduced projection. In that
> sense SSVD solution for a given k is as good as PCA gets for the same k.
> Also, i believe (but not 100% sure) "problems too small" exhibit higher
> errors due to the law of large numbers.
>
>
> On Fri, Aug 2, 2013 at 10:41 AM, Dmitriy Lyubimov <dlie...@gmail.com>
> wrote:
>
> > if you use k > 40 you are already beating Lanczos for larger datasets.
> > k>10 is unlikely meaninful. p need not be more than 15% of k (default is
> > 15). use q=1, q>1 does not yield tangible improvements in real world.
> >  Again, see Nathan Halko's dissertation on accuracy comparison.
> >
> >
> >
> > On Fri, Aug 2, 2013 at 4:17 AM, Fernando Fernández <
> > fernando.fernandez.gonza...@gmail.com> wrote:
> >
> >> Keeping Lanczos would be nice, Like I said, it's currently being used in
> >> some projects with good results and I think it's easier to tune so it
> >> would
> >> be my first choice for future developments. I still need to further test
> >> SSVD, specially because in the current example I'm working it yields
> very
> >> different results from Lanczos. We are investigating if it can be due
> to a
> >> bug when loading the data, though dimensions of the ouptut seem ok, or
> if
> >> it's a question of increasing p or q parameters. If it's a question of
> >> increasing p and q I think running times would make SSVD not viable. I
> >> hope
> >> to be able to provide some comparison figures in terms of precision and
> >> running time in a month or so.
> >>
> >> I hope that other users reads this and say wether they are using
> Lanczos.
> >>
> >> Best,
> >> Fernando.
> >>
> >> 2013/8/2 Sebastian Schelter <s...@apache.org>
> >>
> >> > I would also be fine with keeping if there is demand. I just proposed
> to
> >> > deprecate it and nobody voted against that at that point in time.
> >> >
> >> > --sebastian
> >> >
> >> >
> >> > On 02.08.2013 03:12, Dmitriy Lyubimov wrote:
> >> > > There's a part of Nathan Halko's dissertation referenced on
> algorithm
> >> > page
> >> > > running comparison.  In particular, he was not able to compute more
> >> than
> >> > 40
> >> > > eigenvectors with Lanczos on wikipedia dataset. You may refer to
> that
> >> > > study.
> >> > >
> >> > > On the accuracy part, it was not observed that it was a problem,
> >> assuming
> >> > > high level of random noise is not the case, at least not in LSA-like
> >> > > application used there.
> >> > >
> >> > > That said, i am all for diversity of tools, I would actually be +0
> on
> >> > > deprecating Lanczos, it is not like we are lacking support for it.
> >> SSVD
> >> > > could use improvements too.
> >> > >
> >> > >
> >> > > On Thu, Aug 1, 2013 at 3:15 AM, Fernando Fernández <
> >> > > fernando.fernandez.gonza...@gmail.com> wrote:
> >> > >
> >> > >> Hi everyone,
> >> > >>
> >> > >> Sorry if I duplicate the question but I've been looking for an
> answer
> >> > and I
> >> > >> haven't found an explanation other than it's not being used
> (together
> >> > with
> >> > >> some other algorithms). If it's been discussed in depth before
> maybe
> >> you
> >> > >> can point me to some link with the discussion.
> >> > >>
> >> > >> I have successfully used Lanczos in several projects and it's been
> a
> >> > >> surprise to me finding that the main reason (according to what I've
> >> read
> >> > >> that might not be the full story) is that it's not being used. At
> the
> >> > >> begining I supposed it was because SSVD is supposed to be much
> faster
> >> > with
> >> > >> similar results, but after making some tests I have found that
> >> running
> >> > >> times are similar or even worse than lanczos for some
> configurations
> >> (I
> >> > >> have tried several combinations of parameters, given child
> processes
> >> > enough
> >> > >> memory, etc. and had no success in running SSVD at least in 3/4 of
> >> time
> >> > >> Lanczos runs, thouh they might be some combinations of parameters I
> >> have
> >> > >> still not tried). It seems to be quite tricky to find a good
> >> > combination of
> >> > >> parameters for SSVD and I have seen also a precision loss in some
> >> > examples
> >> > >> that makes me not confident in migrating Lanczos to SSVD from now
> on
> >> > (How
> >> > >> far can I trust results from a combination of parameters that runs
> in
> >> > >> significant less time, or at least a good time?).
> >> > >>
> >> > >> Can someone convince me that SSVD is actually a better option than
> >> > Lanczos?
> >> > >> (I'm totally willing to be convinced... :) )
> >> > >>
> >> > >> Thank you very much in advance.
> >> > >>
> >> > >> Fernando.
> >> > >>
> >> > >
> >> >
> >> >
> >>
> >
> >
>

Reply via email to