For billions of rows, you can do block-wise QR and get the SVD pretty
easily.

Also, the distributed matrix times will get you there with slightly less
numerical stability.

On Thu, Jun 23, 2011 at 12:53 PM, Jake Mannix <jake.man...@gmail.com> wrote:

> Well I'm going to pretend for a second that you're talking about *billions*
> of rows, because tens of millions of rows still fit not just on one box,
> but
> on one box *in memory*... but yes, the mapreduce job you want already
> exists in Mahout: DistributedRowMatrix#times(DistributedRowMatrix)
> does exactly this.
>
> Although it doesn't do the mean-subtraction.
>
>  -jake
>
> 2011/6/23 <tr...@cs.drexel.edu>
>
> > Yes but a M/R job to create the covariance matrix would be required. With
> > millions of rows that is, unless I am missing something.
> >
> > > Doing PCA on 5000 x 5000 matrix is still an in-memory thing.  That's
> > > only  25M doubles, 200MB of memory.  Lots of techniques can run
> > > on that set.  You could do Lanczos with whatever rank you want
> > > on it (but don't worry about distributed lanczos).
> > >
> > > 2011/6/23 <tr...@cs.drexel.edu>
> > >
> > >> I will, if it works I may have to make an m/r job for it. All the data
> > >> we
> > >> have will be tall and dense (lets say 5000 columns, with millions of
> > >> rows). Now doing PCA on that will create a covariance matrix that is
> > >> square and dense. Thanks again guys.
> > >>
> > >> -Trevor
> > >>
> > >> > Try the QR trick.  It is amazingly effective.
> > >> >
> > >> > 2011/6/23 <tr...@cs.drexel.edu>
> > >> >
> > >> >> Alright, thanks guys.
> > >> >>
> > >> >> > The cases where Lanczos or the stochastic projection helps are
> > >> cases
> > >> >> where
> > >> >> > you have *many* columns but where the data are sparse.  If you
> have
> > >> a
> > >> >> very
> > >> >> > tall dense matrix, the QR method is to be muchly preferred.
> > >> >> >
> > >> >> > 2011/6/23 <tr...@cs.drexel.edu>
> > >> >> >
> > >> >> >> Ok, then what would you think to be the minimum number of
> columns
> > >> in
> > >> >> the
> > >> >> >> dataset for Lanczos to give a reasonable result?
> > >> >> >>
> > >> >> >> Thanks,
> > >> >> >> -Trevor
> > >> >> >>
> > >> >> >> > A gazillion rows of 2-columned data is really much better
> suited
> > >> to
> > >> >> >> doing
> > >> >> >> > the following:
> > >> >> >> >
> > >> >> >> > if each row is of the form [a, b], then compute the matrix
> > >> >> >> >
> > >> >> >> > [[a*a, a*b], [a*b, b*b]]
> > >> >> >> >
> > >> >> >> > (the outer product of the vector with itself)
> > >> >> >> >
> > >> >> >> > Then take the matrix sum of all of these, from each row of
> your
> > >> >> input
> > >> >> >> > matrix.
> > >> >> >> >
> > >> >> >> > You'll now have a 2x2 matrix, which you can diagonalize by
> hand.
> > >> >> It
> > >> >> >> will
> > >> >> >> > give you your eigenvalues, and also the right-singular vectors
> > >> of
> > >> >> your
> > >> >> >> > original matrix.
> > >> >> >> >
> > >> >> >> >   -jake
> > >> >> >> >
> > >> >> >> > 2011/6/23 <tr...@cs.drexel.edu>
> > >> >> >> >
> > >> >> >> >> Yes, exactly why I asked it for only 2 eigenvalues. So what
> is
> > >> >> being
> > >> >> >> >> said,
> > >> >> >> >> is if I have lets say 50M rows of 2 columned data, Lanczos
> > >> can't
> > >> >> do
> > >> >> >> >> anything with it (assuming it puts the 0 eigenvalue in the
> mix
> > >> -
> > >> >> of
> > >> >> >> the
> > >> >> >> >> 2
> > >> >> >> >> eigenvectors only 1 is returned because of the 0 eigenvalue
> > >> taking
> > >> >> up
> > >> >> >> a
> > >> >> >> >> slot)?
> > >> >> >> >>
> > >> >> >> >> If the eigenvalue of 0 is invalid, then should it not be
> > >> filtered
> > >> >> out
> > >> >> >> so
> > >> >> >> >> that it returns "rank" number of eigenvalues that could be
> > >> valid?
> > >> >> >> >>
> > >> >> >> >> -Trevor
> > >> >> >> >>
> > >> >> >> >> > Ah, if your matrix only has 2 columns, you can't go to rank
> > >> 10.
> > >> >> >> Try
> > >> >> >> >> on
> > >> >> >> >> > some slightly less synthetic data of more than rank 10.
>  You
> > >> >> can't
> > >> >> >> >> > ask Lanczos for more reduced rank than that of the matrix
> > >> >> itself.
> > >> >> >> >> >
> > >> >> >> >> >   -jake
> > >> >> >> >> >
> > >> >> >> >> > 2011/6/23 <tr...@cs.drexel.edu>
> > >> >> >> >> >
> > >> >> >> >> >> Alright I can reorder that is easy, just had to verify
> that
> > >> the
> > >> >> >> >> ordering
> > >> >> >> >> >> was correct. So when I increased the rank of the results I
> > >> get
> > >> >> >> >> Lanczos
> > >> >> >> >> >> bailing out. Which incidentally causes a
> > >> NullPointerException:
> > >> >> >> >> >>
> > >> >> >> >> >> INFO: 9 passes through the corpus so far...
> > >> >> >> >> >> WARNING: Lanczos parameters out of range: alpha = NaN,
> beta
> > >> =
> > >> >> NaN.
> > >> >> >> >> >> Bailing out early!
> > >> >> >> >> >> INFO: Lanczos iteration complete - now to diagonalize the
> > >> >> >> >> tri-diagonal
> > >> >> >> >> >> auxiliary matrix.
> > >> >> >> >> >> Exception in thread "main" java.lang.NullPointerException
> > >> >> >> >> >>        at
> > >> >> >> >> >>
> > org.apache.mahout.math.DenseVector.assign(DenseVector.java:133)
> > >> >> >> >> >>        at
> > >> >> >> >> >>
> > >> >> >> >> >>
> > >> >> >> >>
> > >> >> >>
> > >> >>
> > >>
> >
> org.apache.mahout.math.decomposer.lanczos.LanczosSolver.solve(LanczosSolver.java:160)
> > >> >> >> >> >>        at pca.PCASolver.solve(PCASolver.java:53)
> > >> >> >> >> >>        at pca.PCA.main(PCA.java:20)
> > >> >> >> >> >>
> > >> >> >> >> >> So I should probably note that my data only has 2 columns,
> > >> the
> > >> >> >> real
> > >> >> >> >> data
> > >> >> >> >> >> will have quite a bit more.
> > >> >> >> >> >>
> > >> >> >> >> >> The failing happens with 10 and more for rank, with the
> > >> last,
> > >> >> and
> > >> >> >> >> >> therefore most significant eigenvector being <NaN,NaN>.
> > >> >> >> >> >>
> > >> >> >> >> >> -Trevor
> > >> >> >> >> >> > The 0 eigenvalue output is not valid, and yes, the
> output
> > >> >> will
> > >> >> >> list
> > >> >> >> >> >> the
> > >> >> >> >> >> > results
> > >> >> >> >> >> > in *increasing* order, even though it is finding the
> > >> largest
> > >> >> >> >> >> > eigenvalues/vectors
> > >> >> >> >> >> > first.
> > >> >> >> >> >> >
> > >> >> >> >> >> > Remember that convergence is gradual, so if you only ask
> > >> for
> > >> >> 3
> > >> >> >> >> >> > eigevectors/values, you won't be very accurate.  If you
> > >> ask
> > >> >> for
> > >> >> >> 10
> > >> >> >> >> or
> > >> >> >> >> >> > more,
> > >> >> >> >> >> > the
> > >> >> >> >> >> > largest few will now be quite good.  If you ask for 50,
> > >> now
> > >> >> the
> > >> >> >> top
> > >> >> >> >> >> 10-20
> > >> >> >> >> >> > will
> > >> >> >> >> >> > be *extremely* accurate, and maybe the top 30 will still
> > >> be
> > >> >> >> quite
> > >> >> >> >> >> good.
> > >> >> >> >> >> >
> > >> >> >> >> >> > Try out a non-distributed form of what is in the
> > >> >> >> >> EigenverificationJob
> > >> >> >> >> >> to
> > >> >> >> >> >> > re-order the output and collect how accurate your
> results
> > >> are
> > >> >> >> (it
> > >> >> >> >> >> computes
> > >> >> >> >> >> > errors for you as well).
> > >> >> >> >> >> >
> > >> >> >> >> >> >   -jake
> > >> >> >> >> >> >
> > >> >> >> >> >> > 2011/6/23 <tr...@cs.drexel.edu>
> > >> >> >> >> >> >
> > >> >> >> >> >> >> So, I know that MAHOUT-369 fixed a bug with the
> > >> distributed
> > >> >> >> >> version
> > >> >> >> >> >> of
> > >> >> >> >> >> >> the
> > >> >> >> >> >> >> LanczosSolver but I am experiencing a similar problem
> > >> with
> > >> >> the
> > >> >> >> >> >> >> non-distributed version.
> > >> >> >> >> >> >>
> > >> >> >> >> >> >> I send a dataset of gaussian distributed numbers
> (testing
> > >> >> PCA
> > >> >> >> >> stuff)
> > >> >> >> >> >> and
> > >> >> >> >> >> >> my eigenvalues are seemingly reversed. Below I have the
> > >> >> output
> > >> >> >> >> given
> > >> >> >> >> >> in
> > >> >> >> >> >> >> the logs from LanczosSolver.
> > >> >> >> >> >> >>
> > >> >> >> >> >> >> Output:
> > >> >> >> >> >> >> INFO: Eigenvector 0 found with eigenvalue 0.0
> > >> >> >> >> >> >> INFO: Eigenvector 1 found with eigenvalue
> > >> 347.8703086831804
> > >> >> >> >> >> >> INFO: LanczosSolver finished.
> > >> >> >> >> >> >>
> > >> >> >> >> >> >> So it returns a vector with eigenvalue 0 before one
> with
> > >> an
> > >> >> >> >> >> eigenvalue
> > >> >> >> >> >> >> of
> > >> >> >> >> >> >> 347?. Whats more interesting is that when I increase
> the
> > >> >> rank,
> > >> >> >> I
> > >> >> >> >> get
> > >> >> >> >> >> a
> > >> >> >> >> >> >> new
> > >> >> >> >> >> >> eigenvector with a value between 0 and 347:
> > >> >> >> >> >> >>
> > >> >> >> >> >> >> INFO: Eigenvector 0 found with eigenvalue 0.0
> > >> >> >> >> >> >> INFO: Eigenvector 1 found with eigenvalue
> > >> 44.794928654801566
> > >> >> >> >> >> >> INFO: Eigenvector 2 found with eigenvalue
> > >> 347.8286920203704
> > >> >> >> >> >> >>
> > >> >> >> >> >> >> Shouldn't the eigenvalues be in descending order? Also
> is
> > >> >> the
> > >> >> >> 0.0
> > >> >> >> >> >> >> eigenvalue even valid?
> > >> >> >> >> >> >>
> > >> >> >> >> >> >> Thanks,
> > >> >> >> >> >> >> Trevor
> > >> >> >> >> >> >>
> > >> >> >> >> >> >>
> > >> >> >> >> >> >
> > >> >> >> >> >>
> > >> >> >> >> >>
> > >> >> >> >> >>
> > >> >> >> >> >
> > >> >> >> >>
> > >> >> >> >>
> > >> >> >> >>
> > >> >> >> >
> > >> >> >>
> > >> >> >>
> > >> >> >>
> > >> >> >
> > >> >>
> > >> >>
> > >> >>
> > >> >
> > >>
> > >>
> > >>
> > >
> >
> >
> >
>

Reply via email to