Grant, Akshay has it right:

  If your input vectors (N of them) have average number of nonzero entries
being "d", then the size of your input is N*d*12bytes (in our case,
with int keys and double values).  The output is the left singular vectors,
which is k *dense* vectors of size M, where M is your row-size (for text:
the size of your dictionary), which is then k*M*8bytes (dense means you
don't need to store the keys).  If you want to project the original inputs
onto the latent factor vectors, the size of this will be k * N * 8bytes.

  So in general, comparing input to output, it's N * d vs. N * k.  In
general, these could be of the same order of size, unless k (the reduced
rank) is small, or d (the document size, roughly) is large (more than a
couple hundred or a thousand unique terms per document).

  In short: SVD should not be thought of as "compression", in most cases.
Reduced dimensionality means a smaller basis you can use, but it's dense
now, so documents don't necessairly get "reduced".  In fact, projecting
individual terms onto the SVD basis *inflates* them from size O(1) to size
O(k).

  -jake

On Sun, Aug 29, 2010 at 3:29 PM, Akshay Bhat <[email protected]> wrote:

> Even though the SVD is supposed to reduce dimensionality it does not means
> that your results will have smaller size [in terms of memory], since U , S
> and V are dense matrices. except if you are using too few eigenvectors.
> Your
> input matrix is a sparse, had it been represented as a dense matrix it
> would
> have far large size.
>
>
> On Sun, Aug 29, 2010 at 5:13 PM, Grant Ingersoll <[email protected]
> >wrote:
>
> > Should be noted, that cranking the rank down to 20 produces a
> significantly
> > smaller result.
> >
> >
> > On Aug 29, 2010, at 4:38 PM, Grant Ingersoll wrote:
> >
> > > I'm running SVD as:
> > > ./mahout svd --input /tmp/solr-clust-n2/part-out.vec --tempDir
> > /tmp/solr-clust-n2/svdTemp --output /tmp/solr-clust-n2/svdOut --rank 200
> > --numCols 65458 --numRows  130103
> > >  ./mahout cleansvd --eigenInput /tmp/solr-clust-n2/svdOut --corpusInput
> > /tmp/solr-clust-n2/part-out.vec --output /tmp/solr-clust-n2/svdFinal
> > --maxError 0.1 --minEigenvalue 10.0
> > >
> > > part-out.vec is 52 MB.  The output from SVD  (svdOut) is 104 MB and
> > largestCleanEigens is 88 MB.  For some reason, this really doesn't feel
> > right.
> > >
> > > Is there a guide on interpreting the output of SVD anywhere?
> >  Intuitively, I believe the output should be a lot smaller?   I mean
> that's
> > the point, right?
> > >
> > > I can share the vector if you want.
> > >
> > > -Grant
> > >
> > > --------------------------
> > > Grant Ingersoll
> > > http://lucenerevolution.org Lucene/Solr Conference, Boston Oct 7-8
> > >
> >
> > --------------------------
> > Grant Ingersoll
> > http://lucenerevolution.org Apache Lucene/Solr Conference, Boston Oct
> 7-8
> >
> >
>
>
> --
> Akshay Uday Bhat.
> Graduate Student, Computer Science, Cornell University
> Website: http://www.akshaybhat.com
>

Reply via email to