One thing to watch out for in clustering SVD stuff is eigenspokes. www.cs.cmu.edu/~badityap/papers/*eigenspokes*-pakdd10.pdf
On Mon, Jun 6, 2011 at 12:08 PM, Stefan Wienert <ste...@wienert.cc> wrote: > Yeha :) That's what I assumed. So this problem is solved. Thanks! > > And now your questions: > > I want to do LSA/LSI with Mahout. Therefore I take the TDIDF > Document-Term-Matrix and reduce it using Lanczos. So I get the > Document-Concept-Vectors. These Vectors will be compared with cosine > similarity to find similar documents. I can also cluster these > documents with k-means... > > If you have any suggestions, feel free to tell me. I'm also interested > in other document-similarity-techniques. > > Cheers, > Stefan > > 2011/6/6 Danny Bickson <danny.bick...@gmail.com>: > > If I understand your question correctly, you need to simply transpose M > > before you start the run and that way > > you will get the other singular vectors. > > > > May I ask what is the problem you are working on and why do you need the > > singular vectors? > > Can you consider using another matrix decomposition technique for example > > alternating least squares > > which gives you two lower rank matrices which simulates the large > decomposed > > matrix? > > > > On Mon, Jun 6, 2011 at 1:30 PM, Stefan Wienert <ste...@wienert.cc> > wrote: > > > >> Hi Danny! > >> > >> I understand that for M*M' (and for M'*M) the left and right > >> eigenvectors are identical. But that is not exactly what I want. The > >> lanczos solver from mahout gives me the eigenvectors of M*M', which > >> are the left singular vectors of M. But I need the right singular > >> vectors of M (and not M*M'). How do I get them? > >> > >> Sorry, my matrix math is not as good as it should be, but I hope you > >> can help me! > >> > >> Thanks, > >> Stefan > >> > >> 2011/6/6 Danny Bickson <danny.bick...@gmail.com>: > >> > Hi Stefan! > >> > For a positive semidefinite matrix, the lest and right eigenvectors > are > >> > identical. > >> > See SVD wikipeida text: When *M* is also positive > >> > semi-definite<http://en.wikipedia.org/wiki/Positive-definite_matrix>, > >> > the decomposition *M* = *U**D**U* * is also a singular value > >> decomposition. > >> > So you don't need to be worried about the other singular vectors. > >> > > >> > Hope this helps! > >> > > >> > On Mon, Jun 6, 2011 at 12:57 PM, Stefan Wienert <ste...@wienert.cc> > >> wrote: > >> > > >> >> Hi. > >> >> > >> >> Thanks for the help. > >> >> > >> >> The important points from wikipedia are: > >> >> - The left singular vectors of M are eigenvectors of M*M' . > >> >> - The right singular vectors of M are eigenvectors of M'*M. > >> >> > >> >> as you describe, the mahout lanczos solver calculate A=M'*M (I think > >> >> it does A=M*M', but it is not a problem). Therefore it does already > >> >> calculate the right (or left) singular vector of M. > >> >> > >> >> But my question is, how can I get the other singular vector? I can > >> >> transpose M, but then I have to calculated two SVDs, one for the > right > >> >> and one for the left singular value... I think there is a better way > >> >> :) > >> >> > >> >> Hope you can help me with this... > >> >> Thanks > >> >> Stefan > >> >> > >> >> > >> >> 2011/6/6 Danny Bickson <danny.bick...@gmail.com>: > >> >> > Hi > >> >> > Mahout SVD implementation computes the Lanzcos iteration: > >> >> > http://en.wikipedia.org/wiki/Lanczos_algorithm > >> >> > Denote the non-square input matrix as M. First a symmetric matrix A > is > >> >> > computed by A=M'*M > >> >> > Then an approximating tridiagonal matrix T and a vector matrix V > are > >> >> > computed such that A =~ V*T*V' > >> >> > (this process is done in a distributed way). > >> >> > > >> >> > Next the matrix T is next decomposed into eigenvectors and > >> eignevalues. > >> >> > Which is the returned result. (This process > >> >> > is serial). > >> >> > > >> >> > The third step makes the returned eigenvectors orthogonal to each > >> other > >> >> > (which is optional IMHO). > >> >> > > >> >> > The heart of the code is found at: > >> >> > > >> >> > >> > ./math/src/main/java/org/apache/mahout/math/decomposer/lanczos/LanczosSolver.java > >> >> > At least that is where it was in version 0.4 I am not sure if there > >> are > >> >> > changes in version 0.5 > >> >> > > >> >> > Anyway, Mahout does not compute directly SVD. If you are interested > in > >> >> > learning more about the relation to SVD > >> >> > look at: http://en.wikipedia.org/wiki/Singular_value_decomposition > , > >> >> > subsection: relation to eigenvalue decomposition. > >> >> > > >> >> > Hope this helps, > >> >> > > >> >> > Danny Bickson > >> >> > > >> >> > On Mon, Jun 6, 2011 at 9:35 AM, Stefan Wienert <ste...@wienert.cc> > >> >> wrote: > >> >> > > >> >> >> After reading this thread: > >> >> >> > >> >> >> > >> >> > >> > http://mail-archives.apache.org/mod_mbox/mahout-user/201102.mbox/%3caanlktinq5k4xrm7nabwn8qobxzgvobbot2rtjzsv4...@mail.gmail.com%3E > >> >> >> > >> >> >> Wiki-SVD: M = U S V* (* = transposed) > >> >> >> > >> >> >> The output of Mahout-SVD is (U S) right? > >> >> >> > >> >> >> So... How do I get V from (U S) and M? > >> >> >> > >> >> >> Is V = M (U S)* (because this is, what the calculation in the > example > >> >> is)? > >> >> >> > >> >> >> Thanks > >> >> >> Stefan > >> >> >> > >> >> >> 2011/6/6 Stefan Wienert <ste...@wienert.cc>: > >> >> >> > > >> >> > >> > https://cwiki.apache.org/confluence/display/MAHOUT/Dimensional+Reduction > >> >> >> > > >> >> >> > What is done: > >> >> >> > > >> >> >> > Input: > >> >> >> > tf-idf-matrix (docs x terms) 6076937 x 20444 > >> >> >> > > >> >> >> > "SVD" of tf-idf-matrix (rank 100) produces the eigenvector (and > >> >> >> > eigenvalues) of tf-idf-matrix, called: > >> >> >> > svd (concepts x terms) 87 x 20444 > >> >> >> > > >> >> >> > transpose tf-idf-matrix: > >> >> >> > tf-idf-matrix-transpose (terms x docs) 20444 x 6076937 > >> >> >> > > >> >> >> > transpose svd: > >> >> >> > svd-transpose (terms x concepts) 20444 x 87 > >> >> >> > > >> >> >> > matrix multiply: > >> >> >> > tf-idf-matrix-transpose x svd-transpose = result > >> >> >> > (terms x docs) x (terms x concepts) = (docs x concepts) > >> >> >> > > >> >> >> > so... I do understand, that the "svd" here is not SVD from > >> wikipedia. > >> >> >> > It only does the Lanczos algorithm and some magic which produces > >> the > >> >> >> >> Instead either the left or right (but usually the right) > >> eigenvectors > >> >> >> premultiplied by the diagonal or the square root of the > >> >> >> >> diagonal element. > >> >> >> > from > >> >> >> > >> >> > >> > http://mail-archives.apache.org/mod_mbox/mahout-user/201102.mbox/%3CAANLkTi=rta7tfrm8zi60vcfya5xf+dbfrj8pcds2n...@mail.gmail.com%3E > >> >> >> > > >> >> >> > so my question: what is the output of the SVD in mahout. And > what > >> do I > >> >> >> > have to calculate to get the "right singular value" from svd? > >> >> >> > > >> >> >> > Thanks, > >> >> >> > Stefan > >> >> >> > > >> >> >> > 2011/6/6 Stefan Wienert <ste...@wienert.cc>: > >> >> >> >> > >> >> >> > >> >> > >> > https://cwiki.apache.org/confluence/display/MAHOUT/Dimensional+Reduction > >> >> >> >> > >> >> >> >> the last step is the matrix multiplication: > >> >> >> >> --arg --numRowsA --arg 20444 \ > >> >> >> >> --arg --numColsA --arg 6076937 \ > >> >> >> >> --arg --numRowsB --arg 20444 \ > >> >> >> >> --arg --numColsB --arg 87 \ > >> >> >> >> so the result is a 6,076,937 x 87 matrix > >> >> >> >> > >> >> >> >> the input has 6,076,937 (each with 20,444 terms). so the result > of > >> >> >> >> matrix multiplication has to be the right singular value > regarding > >> to > >> >> >> >> the dimensions. > >> >> >> >> > >> >> >> >> so the result is the "concept-document vector matrix" (as I > think, > >> >> >> >> these is also called "document vectors" ?) > >> >> >> >> > >> >> >> >> 2011/6/6 Ted Dunning <ted.dunn...@gmail.com>: > >> >> >> >>> Yes. These are term vectors, not document vectors. > >> >> >> >>> > >> >> >> >>> There is an additional step that can be run to produce > document > >> >> >> vectors. > >> >> >> >>> > >> >> >> >>> On Sun, Jun 5, 2011 at 1:16 PM, Stefan Wienert > <ste...@wienert.cc > >> > > >> >> >> wrote: > >> >> >> >>> > >> >> >> >>>> compared to SVD, is the result is the "right singular value"? > >> >> >> >>>> > >> >> >> >>> > >> >> >> >> > >> >> >> >> > >> >> >> >> > >> >> >> >> -- > >> >> >> >> Stefan Wienert > >> >> >> >> > >> >> >> >> http://www.wienert.cc > >> >> >> >> ste...@wienert.cc > >> >> >> >> > >> >> >> >> Telefon: +495251-2026838 > >> >> >> >> Mobil: +49176-40170270 > >> >> >> >> > >> >> >> > > >> >> >> > > >> >> >> > > >> >> >> > -- > >> >> >> > Stefan Wienert > >> >> >> > > >> >> >> > http://www.wienert.cc > >> >> >> > ste...@wienert.cc > >> >> >> > > >> >> >> > Telefon: +495251-2026838 > >> >> >> > Mobil: +49176-40170270 > >> >> >> > > >> >> >> > >> >> >> > >> >> >> > >> >> >> -- > >> >> >> Stefan Wienert > >> >> >> > >> >> >> http://www.wienert.cc > >> >> >> ste...@wienert.cc > >> >> >> > >> >> >> Telefon: +495251-2026838 > >> >> >> Mobil: +49176-40170270 > >> >> >> > >> >> > > >> >> > >> >> > >> >> > >> >> -- > >> >> Stefan Wienert > >> >> > >> >> http://www.wienert.cc > >> >> ste...@wienert.cc > >> >> > >> >> Telefon: +495251-2026838 > >> >> Mobil: +49176-40170270 > >> >> > >> > > >> > >> > >> > >> -- > >> Stefan Wienert > >> > >> http://www.wienert.cc > >> ste...@wienert.cc > >> > >> Telefon: +495251-2026838 > >> Mobil: +49176-40170270 > >> > > > > > > -- > Stefan Wienert > > http://www.wienert.cc > ste...@wienert.cc > > Telefon: +495251-2026838 > Mobil: +49176-40170270 >