Before I rewrite my program, is there any advantage over the lanczos svd? 2011/6/7 Dmitriy Lyubimov <dlie...@gmail.com>: > I am saying i did not test it with 0.20.2 > > Yes it is integrated in 0.5 release but there might be problems with > hadoop 0.20.2 > > On Tue, Jun 7, 2011 at 12:55 PM, Stefan Wienert <ste...@wienert.cc> wrote: >> Hmm... looks nice... >> >> So there is a Lanczos implementation of SVD and the stochastic version >> of SVD. Both produce the doc-concept vectors that I need. >> I acctualy get my TFIDF Vectors directly from a lucene index (and have >> to do some magic to get IntWritable, VectorWritable). >> >> Still, I do not exactly understand what you trying to say. Your SSVD >> does not run with mahout but on CDH (what is this btw?)? Or is it >> available for mahout? So what has to be modified to run it with >> mahout? >> >> Thanks >> Stefan >> >> >> >> 2011/6/7 Dmitriy Lyubimov <dlie...@gmail.com>: >>> I also do LSI/LSA and wrote a stochastic svd that is capable to >>> produce exact U, V and Sigma , >>> or UxSigma^-0.5 or VxSigma^-0.5, whatever you require. >>> >>> On top of that, it is adjusted to be LSI-friendly if your keys are not >>> necessarily DoubleWritable (you can still keep your original document >>> paths or whatever they are identified with), that info is migrated as >>> keys of the output of U matrix. So you can run it directly on the >>> output of the Mahout's seq2sparse (I actually tested that on reuters >>> dataset example from Mahout In Action). >>> >>> The computation has a stochastic noise to it, but LSI problems are >>> never exact problems anyway and their result are subject to corpus >>> selection. >>> >>> If you are interested to help me to work kinks out in Mahout's >>> version, I'd be grateful since i don't run the method on 0.20.2 but on >>> CDH in a customized way. But be warned that it may require a number of >>> patches before it works for you. >>> >>> Here is (a little bit too wordy) command line manual for Mahout 0.5. >>> http://weatheringthrutechdays.blogspot.com/2011/03/ssvd-command-line-usage.html >>> >>> Thanks. >>> >>> -D >>> >>> >>> On Mon, Jun 6, 2011 at 12:08 PM, Stefan Wienert <ste...@wienert.cc> wrote: >>>> Yeha :) That's what I assumed. So this problem is solved. Thanks! >>>> >>>> And now your questions: >>>> >>>> I want to do LSA/LSI with Mahout. Therefore I take the TDIDF >>>> Document-Term-Matrix and reduce it using Lanczos. So I get the >>>> Document-Concept-Vectors. These Vectors will be compared with cosine >>>> similarity to find similar documents. I can also cluster these >>>> documents with k-means... >>>> >>>> If you have any suggestions, feel free to tell me. I'm also interested >>>> in other document-similarity-techniques. >>>> >>>> Cheers, >>>> Stefan >>>> >>>> 2011/6/6 Danny Bickson <danny.bick...@gmail.com>: >>>>> If I understand your question correctly, you need to simply transpose M >>>>> before you start the run and that way >>>>> you will get the other singular vectors. >>>>> >>>>> May I ask what is the problem you are working on and why do you need the >>>>> singular vectors? >>>>> Can you consider using another matrix decomposition technique for example >>>>> alternating least squares >>>>> which gives you two lower rank matrices which simulates the large >>>>> decomposed >>>>> matrix? >>>>> >>>>> On Mon, Jun 6, 2011 at 1:30 PM, Stefan Wienert <ste...@wienert.cc> wrote: >>>>> >>>>>> Hi Danny! >>>>>> >>>>>> I understand that for M*M' (and for M'*M) the left and right >>>>>> eigenvectors are identical. But that is not exactly what I want. The >>>>>> lanczos solver from mahout gives me the eigenvectors of M*M', which >>>>>> are the left singular vectors of M. But I need the right singular >>>>>> vectors of M (and not M*M'). How do I get them? >>>>>> >>>>>> Sorry, my matrix math is not as good as it should be, but I hope you >>>>>> can help me! >>>>>> >>>>>> Thanks, >>>>>> Stefan >>>>>> >>>>>> 2011/6/6 Danny Bickson <danny.bick...@gmail.com>: >>>>>> > Hi Stefan! >>>>>> > For a positive semidefinite matrix, the lest and right eigenvectors are >>>>>> > identical. >>>>>> > See SVD wikipeida text: When *M* is also positive >>>>>> > semi-definite<http://en.wikipedia.org/wiki/Positive-definite_matrix>, >>>>>> > the decomposition *M* = *U**D**U* * is also a singular value >>>>>> decomposition. >>>>>> > So you don't need to be worried about the other singular vectors. >>>>>> > >>>>>> > Hope this helps! >>>>>> > >>>>>> > On Mon, Jun 6, 2011 at 12:57 PM, Stefan Wienert <ste...@wienert.cc> >>>>>> wrote: >>>>>> > >>>>>> >> Hi. >>>>>> >> >>>>>> >> Thanks for the help. >>>>>> >> >>>>>> >> The important points from wikipedia are: >>>>>> >> - The left singular vectors of M are eigenvectors of M*M' . >>>>>> >> - The right singular vectors of M are eigenvectors of M'*M. >>>>>> >> >>>>>> >> as you describe, the mahout lanczos solver calculate A=M'*M (I think >>>>>> >> it does A=M*M', but it is not a problem). Therefore it does already >>>>>> >> calculate the right (or left) singular vector of M. >>>>>> >> >>>>>> >> But my question is, how can I get the other singular vector? I can >>>>>> >> transpose M, but then I have to calculated two SVDs, one for the right >>>>>> >> and one for the left singular value... I think there is a better way >>>>>> >> :) >>>>>> >> >>>>>> >> Hope you can help me with this... >>>>>> >> Thanks >>>>>> >> Stefan >>>>>> >> >>>>>> >> >>>>>> >> 2011/6/6 Danny Bickson <danny.bick...@gmail.com>: >>>>>> >> > Hi >>>>>> >> > Mahout SVD implementation computes the Lanzcos iteration: >>>>>> >> > http://en.wikipedia.org/wiki/Lanczos_algorithm >>>>>> >> > Denote the non-square input matrix as M. First a symmetric matrix A >>>>>> >> > is >>>>>> >> > computed by A=M'*M >>>>>> >> > Then an approximating tridiagonal matrix T and a vector matrix V are >>>>>> >> > computed such that A =~ V*T*V' >>>>>> >> > (this process is done in a distributed way). >>>>>> >> > >>>>>> >> > Next the matrix T is next decomposed into eigenvectors and >>>>>> eignevalues. >>>>>> >> > Which is the returned result. (This process >>>>>> >> > is serial). >>>>>> >> > >>>>>> >> > The third step makes the returned eigenvectors orthogonal to each >>>>>> other >>>>>> >> > (which is optional IMHO). >>>>>> >> > >>>>>> >> > The heart of the code is found at: >>>>>> >> > >>>>>> >> >>>>>> ./math/src/main/java/org/apache/mahout/math/decomposer/lanczos/LanczosSolver.java >>>>>> >> > At least that is where it was in version 0.4 I am not sure if there >>>>>> are >>>>>> >> > changes in version 0.5 >>>>>> >> > >>>>>> >> > Anyway, Mahout does not compute directly SVD. If you are interested >>>>>> >> > in >>>>>> >> > learning more about the relation to SVD >>>>>> >> > look at: http://en.wikipedia.org/wiki/Singular_value_decomposition, >>>>>> >> > subsection: relation to eigenvalue decomposition. >>>>>> >> > >>>>>> >> > Hope this helps, >>>>>> >> > >>>>>> >> > Danny Bickson >>>>>> >> > >>>>>> >> > On Mon, Jun 6, 2011 at 9:35 AM, Stefan Wienert <ste...@wienert.cc> >>>>>> >> wrote: >>>>>> >> > >>>>>> >> >> After reading this thread: >>>>>> >> >> >>>>>> >> >> >>>>>> >> >>>>>> http://mail-archives.apache.org/mod_mbox/mahout-user/201102.mbox/%3caanlktinq5k4xrm7nabwn8qobxzgvobbot2rtjzsv4...@mail.gmail.com%3E >>>>>> >> >> >>>>>> >> >> Wiki-SVD: M = U S V* (* = transposed) >>>>>> >> >> >>>>>> >> >> The output of Mahout-SVD is (U S) right? >>>>>> >> >> >>>>>> >> >> So... How do I get V from (U S) and M? >>>>>> >> >> >>>>>> >> >> Is V = M (U S)* (because this is, what the calculation in the >>>>>> >> >> example >>>>>> >> is)? >>>>>> >> >> >>>>>> >> >> Thanks >>>>>> >> >> Stefan >>>>>> >> >> >>>>>> >> >> 2011/6/6 Stefan Wienert <ste...@wienert.cc>: >>>>>> >> >> > >>>>>> >> >>>>>> https://cwiki.apache.org/confluence/display/MAHOUT/Dimensional+Reduction >>>>>> >> >> > >>>>>> >> >> > What is done: >>>>>> >> >> > >>>>>> >> >> > Input: >>>>>> >> >> > tf-idf-matrix (docs x terms) 6076937 x 20444 >>>>>> >> >> > >>>>>> >> >> > "SVD" of tf-idf-matrix (rank 100) produces the eigenvector (and >>>>>> >> >> > eigenvalues) of tf-idf-matrix, called: >>>>>> >> >> > svd (concepts x terms) 87 x 20444 >>>>>> >> >> > >>>>>> >> >> > transpose tf-idf-matrix: >>>>>> >> >> > tf-idf-matrix-transpose (terms x docs) 20444 x 6076937 >>>>>> >> >> > >>>>>> >> >> > transpose svd: >>>>>> >> >> > svd-transpose (terms x concepts) 20444 x 87 >>>>>> >> >> > >>>>>> >> >> > matrix multiply: >>>>>> >> >> > tf-idf-matrix-transpose x svd-transpose = result >>>>>> >> >> > (terms x docs) x (terms x concepts) = (docs x concepts) >>>>>> >> >> > >>>>>> >> >> > so... I do understand, that the "svd" here is not SVD from >>>>>> wikipedia. >>>>>> >> >> > It only does the Lanczos algorithm and some magic which produces >>>>>> the >>>>>> >> >> >> Instead either the left or right (but usually the right) >>>>>> eigenvectors >>>>>> >> >> premultiplied by the diagonal or the square root of the >>>>>> >> >> >> diagonal element. >>>>>> >> >> > from >>>>>> >> >> >>>>>> >> >>>>>> http://mail-archives.apache.org/mod_mbox/mahout-user/201102.mbox/%3CAANLkTi=rta7tfrm8zi60vcfya5xf+dbfrj8pcds2n...@mail.gmail.com%3E >>>>>> >> >> > >>>>>> >> >> > so my question: what is the output of the SVD in mahout. And what >>>>>> do I >>>>>> >> >> > have to calculate to get the "right singular value" from svd? >>>>>> >> >> > >>>>>> >> >> > Thanks, >>>>>> >> >> > Stefan >>>>>> >> >> > >>>>>> >> >> > 2011/6/6 Stefan Wienert <ste...@wienert.cc>: >>>>>> >> >> >> >>>>>> >> >> >>>>>> >> >>>>>> https://cwiki.apache.org/confluence/display/MAHOUT/Dimensional+Reduction >>>>>> >> >> >> >>>>>> >> >> >> the last step is the matrix multiplication: >>>>>> >> >> >> --arg --numRowsA --arg 20444 \ >>>>>> >> >> >> --arg --numColsA --arg 6076937 \ >>>>>> >> >> >> --arg --numRowsB --arg 20444 \ >>>>>> >> >> >> --arg --numColsB --arg 87 \ >>>>>> >> >> >> so the result is a 6,076,937 x 87 matrix >>>>>> >> >> >> >>>>>> >> >> >> the input has 6,076,937 (each with 20,444 terms). so the result >>>>>> >> >> >> of >>>>>> >> >> >> matrix multiplication has to be the right singular value >>>>>> >> >> >> regarding >>>>>> to >>>>>> >> >> >> the dimensions. >>>>>> >> >> >> >>>>>> >> >> >> so the result is the "concept-document vector matrix" (as I >>>>>> >> >> >> think, >>>>>> >> >> >> these is also called "document vectors" ?) >>>>>> >> >> >> >>>>>> >> >> >> 2011/6/6 Ted Dunning <ted.dunn...@gmail.com>: >>>>>> >> >> >>> Yes. These are term vectors, not document vectors. >>>>>> >> >> >>> >>>>>> >> >> >>> There is an additional step that can be run to produce document >>>>>> >> >> vectors. >>>>>> >> >> >>> >>>>>> >> >> >>> On Sun, Jun 5, 2011 at 1:16 PM, Stefan Wienert >>>>>> >> >> >>> <ste...@wienert.cc >>>>>> > >>>>>> >> >> wrote: >>>>>> >> >> >>> >>>>>> >> >> >>>> compared to SVD, is the result is the "right singular value"? >>>>>> >> >> >>>> >>>>>> >> >> >>> >>>>>> >> >> >> >>>>>> >> >> >> >>>>>> >> >> >> >>>>>> >> >> >> -- >>>>>> >> >> >> Stefan Wienert >>>>>> >> >> >> >>>>>> >> >> >> http://www.wienert.cc >>>>>> >> >> >> ste...@wienert.cc >>>>>> >> >> >> >>>>>> >> >> >> Telefon: +495251-2026838 >>>>>> >> >> >> Mobil: +49176-40170270 >>>>>> >> >> >> >>>>>> >> >> > >>>>>> >> >> > >>>>>> >> >> > >>>>>> >> >> > -- >>>>>> >> >> > Stefan Wienert >>>>>> >> >> > >>>>>> >> >> > http://www.wienert.cc >>>>>> >> >> > ste...@wienert.cc >>>>>> >> >> > >>>>>> >> >> > Telefon: +495251-2026838 >>>>>> >> >> > Mobil: +49176-40170270 >>>>>> >> >> > >>>>>> >> >> >>>>>> >> >> >>>>>> >> >> >>>>>> >> >> -- >>>>>> >> >> Stefan Wienert >>>>>> >> >> >>>>>> >> >> http://www.wienert.cc >>>>>> >> >> ste...@wienert.cc >>>>>> >> >> >>>>>> >> >> Telefon: +495251-2026838 >>>>>> >> >> Mobil: +49176-40170270 >>>>>> >> >> >>>>>> >> > >>>>>> >> >>>>>> >> >>>>>> >> >>>>>> >> -- >>>>>> >> Stefan Wienert >>>>>> >> >>>>>> >> http://www.wienert.cc >>>>>> >> ste...@wienert.cc >>>>>> >> >>>>>> >> Telefon: +495251-2026838 >>>>>> >> Mobil: +49176-40170270 >>>>>> >> >>>>>> > >>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> Stefan Wienert >>>>>> >>>>>> http://www.wienert.cc >>>>>> ste...@wienert.cc >>>>>> >>>>>> Telefon: +495251-2026838 >>>>>> Mobil: +49176-40170270 >>>>>> >>>>> >>>> >>>> >>>> >>>> -- >>>> Stefan Wienert >>>> >>>> http://www.wienert.cc >>>> ste...@wienert.cc >>>> >>>> Telefon: +495251-2026838 >>>> Mobil: +49176-40170270 >>>> >>> >> >> >> >> -- >> Stefan Wienert >> >> http://www.wienert.cc >> ste...@wienert.cc >> >> Telefon: +495251-2026838 >> Mobil: +49176-40170270 >> >
-- Stefan Wienert http://www.wienert.cc ste...@wienert.cc Telefon: +495251-2026838 Mobil: +49176-40170270