Grant, Akshay has it right: If your input vectors (N of them) have average number of nonzero entries being "d", then the size of your input is N*d*12bytes (in our case, with int keys and double values). The output is the left singular vectors, which is k *dense* vectors of size M, where M is your row-size (for text: the size of your dictionary), which is then k*M*8bytes (dense means you don't need to store the keys). If you want to project the original inputs onto the latent factor vectors, the size of this will be k * N * 8bytes.
So in general, comparing input to output, it's N * d vs. N * k. In general, these could be of the same order of size, unless k (the reduced rank) is small, or d (the document size, roughly) is large (more than a couple hundred or a thousand unique terms per document). In short: SVD should not be thought of as "compression", in most cases. Reduced dimensionality means a smaller basis you can use, but it's dense now, so documents don't necessairly get "reduced". In fact, projecting individual terms onto the SVD basis *inflates* them from size O(1) to size O(k). -jake On Sun, Aug 29, 2010 at 3:29 PM, Akshay Bhat <[email protected]> wrote: > Even though the SVD is supposed to reduce dimensionality it does not means > that your results will have smaller size [in terms of memory], since U , S > and V are dense matrices. except if you are using too few eigenvectors. > Your > input matrix is a sparse, had it been represented as a dense matrix it > would > have far large size. > > > On Sun, Aug 29, 2010 at 5:13 PM, Grant Ingersoll <[email protected] > >wrote: > > > Should be noted, that cranking the rank down to 20 produces a > significantly > > smaller result. > > > > > > On Aug 29, 2010, at 4:38 PM, Grant Ingersoll wrote: > > > > > I'm running SVD as: > > > ./mahout svd --input /tmp/solr-clust-n2/part-out.vec --tempDir > > /tmp/solr-clust-n2/svdTemp --output /tmp/solr-clust-n2/svdOut --rank 200 > > --numCols 65458 --numRows 130103 > > > ./mahout cleansvd --eigenInput /tmp/solr-clust-n2/svdOut --corpusInput > > /tmp/solr-clust-n2/part-out.vec --output /tmp/solr-clust-n2/svdFinal > > --maxError 0.1 --minEigenvalue 10.0 > > > > > > part-out.vec is 52 MB. The output from SVD (svdOut) is 104 MB and > > largestCleanEigens is 88 MB. For some reason, this really doesn't feel > > right. > > > > > > Is there a guide on interpreting the output of SVD anywhere? > > Intuitively, I believe the output should be a lot smaller? I mean > that's > > the point, right? > > > > > > I can share the vector if you want. > > > > > > -Grant > > > > > > -------------------------- > > > Grant Ingersoll > > > http://lucenerevolution.org Lucene/Solr Conference, Boston Oct 7-8 > > > > > > > -------------------------- > > Grant Ingersoll > > http://lucenerevolution.org Apache Lucene/Solr Conference, Boston Oct > 7-8 > > > > > > > -- > Akshay Uday Bhat. > Graduate Student, Computer Science, Cornell University > Website: http://www.akshaybhat.com >
