Thanks to all for the explanations.
On Aug 29, 2010, at 7:49 PM, Ted Dunning wrote: > Like Jake said. > > On Sun, Aug 29, 2010 at 4:48 PM, Ted Dunning <[email protected]> wrote: > >> >> In particular, since our sparse representation requires an int (4 bytes) >> and a double (8 bytes) to store one non-zero entry while a dense row >> requires only 8 bytes per entry then your original data would require less >> storage if it has less than 200 * 8 / 12 = 133 non-zero >> entries per row on average. Depending on the data-set, this could be very >> likely or totally implausible. >> >> SVD is still useful in these cases because it can provide useful smoothing. >> >> >> On Sun, Aug 29, 2010 at 3:29 PM, Akshay Bhat <[email protected]>wrote: >> >>> Even though the SVD is supposed to reduce dimensionality it does not means >>> that your results will have smaller size [in terms of memory], since U , S >>> and V are dense matrices. except if you are using too few eigenvectors. >>> Your >>> input matrix is a sparse, had it been represented as a dense matrix it >>> would >>> have far large size. >>> >>> >>> On Sun, Aug 29, 2010 at 5:13 PM, Grant Ingersoll <[email protected] >>>> wrote: >>> >>>> Should be noted, that cranking the rank down to 20 produces a >>> significantly >>>> smaller result. >>>> >>>> >>>> On Aug 29, 2010, at 4:38 PM, Grant Ingersoll wrote: >>>> >>>>> I'm running SVD as: >>>>> ./mahout svd --input /tmp/solr-clust-n2/part-out.vec --tempDir >>>> /tmp/solr-clust-n2/svdTemp --output /tmp/solr-clust-n2/svdOut --rank 200 >>>> --numCols 65458 --numRows 130103 >>>>> ./mahout cleansvd --eigenInput /tmp/solr-clust-n2/svdOut >>> --corpusInput >>>> /tmp/solr-clust-n2/part-out.vec --output /tmp/solr-clust-n2/svdFinal >>>> --maxError 0.1 --minEigenvalue 10.0 >>>>> >>>>> part-out.vec is 52 MB. The output from SVD (svdOut) is 104 MB and >>>> largestCleanEigens is 88 MB. For some reason, this really doesn't feel >>>> right. >>>>> >>>>> Is there a guide on interpreting the output of SVD anywhere? >>>> Intuitively, I believe the output should be a lot smaller? I mean >>> that's >>>> the point, right? >>>>> >>>>> I can share the vector if you want. >>>>> >>>>> -Grant >>>>> >>>>> -------------------------- >>>>> Grant Ingersoll >>>>> http://lucenerevolution.org Lucene/Solr Conference, Boston Oct 7-8 >>>>> >>>> >>>> -------------------------- >>>> Grant Ingersoll >>>> http://lucenerevolution.org Apache Lucene/Solr Conference, Boston Oct >>> 7-8 >>>> >>>> >>> >>> >>> -- >>> Akshay Uday Bhat. >>> Graduate Student, Computer Science, Cornell University >>> Website: http://www.akshaybhat.com >>> >> >> -------------------------- Grant Ingersoll http://lucenerevolution.org Apache Lucene/Solr Conference, Boston Oct 7-8
