Before I rewrite my program, is there any advantage over the lanczos svd?

2011/6/7 Dmitriy Lyubimov <dlie...@gmail.com>:
> I am saying i did not test it with 0.20.2
>
> Yes it is integrated in 0.5 release but there might be problems with
> hadoop 0.20.2
>
> On Tue, Jun 7, 2011 at 12:55 PM, Stefan Wienert <ste...@wienert.cc> wrote:
>> Hmm... looks nice...
>>
>> So there is a Lanczos implementation of SVD and the stochastic version
>> of SVD. Both produce the doc-concept vectors that I need.
>> I acctualy get my TFIDF Vectors directly from a lucene index (and have
>> to do some magic to get IntWritable, VectorWritable).
>>
>> Still, I do not exactly understand what you trying to say. Your SSVD
>> does not run with mahout but on CDH (what is this btw?)? Or is it
>> available for mahout? So what has to be modified to run it with
>> mahout?
>>
>> Thanks
>> Stefan
>>
>>
>>
>> 2011/6/7 Dmitriy Lyubimov <dlie...@gmail.com>:
>>> I also do LSI/LSA and wrote a stochastic svd that is capable to
>>> produce exact U, V and Sigma ,
>>> or UxSigma^-0.5 or VxSigma^-0.5, whatever you require.
>>>
>>> On top of that, it is adjusted to be LSI-friendly if your keys are not
>>> necessarily DoubleWritable (you can still keep your original document
>>> paths or whatever they are identified with), that info is migrated as
>>> keys of the output of U matrix. So you can run it directly on the
>>> output of the Mahout's seq2sparse (I actually tested that on reuters
>>> dataset example from Mahout In Action).
>>>
>>> The computation has a stochastic noise to it, but LSI problems are
>>> never exact problems anyway and their result are subject to corpus
>>> selection.
>>>
>>> If you are interested to help me to work kinks out in Mahout's
>>> version, I'd be grateful since i don't run the method on 0.20.2 but on
>>> CDH in a customized way. But be warned that it may require a number of
>>> patches before it works for you.
>>>
>>> Here is (a little bit too wordy) command line manual for Mahout 0.5.
>>> http://weatheringthrutechdays.blogspot.com/2011/03/ssvd-command-line-usage.html
>>>
>>> Thanks.
>>>
>>> -D
>>>
>>>
>>> On Mon, Jun 6, 2011 at 12:08 PM, Stefan Wienert <ste...@wienert.cc> wrote:
>>>> Yeha :) That's what I assumed. So this problem is solved. Thanks!
>>>>
>>>> And now your questions:
>>>>
>>>> I want to do LSA/LSI with Mahout. Therefore I take the TDIDF
>>>> Document-Term-Matrix and reduce it using Lanczos. So I get the
>>>> Document-Concept-Vectors. These Vectors will be compared with cosine
>>>> similarity to find similar documents. I can also cluster these
>>>> documents with k-means...
>>>>
>>>> If you have any suggestions, feel free to tell me. I'm also interested
>>>> in other document-similarity-techniques.
>>>>
>>>> Cheers,
>>>> Stefan
>>>>
>>>> 2011/6/6 Danny Bickson <danny.bick...@gmail.com>:
>>>>> If I understand your question correctly, you need to simply transpose M
>>>>> before you start the run and that way
>>>>> you will get the other singular vectors.
>>>>>
>>>>> May I ask what is the problem you are working on and why do you need the
>>>>> singular vectors?
>>>>> Can you consider using another matrix decomposition technique for example
>>>>> alternating least squares
>>>>> which gives you two lower rank matrices which simulates the large 
>>>>> decomposed
>>>>> matrix?
>>>>>
>>>>> On Mon, Jun 6, 2011 at 1:30 PM, Stefan Wienert <ste...@wienert.cc> wrote:
>>>>>
>>>>>> Hi Danny!
>>>>>>
>>>>>> I understand that for M*M' (and for M'*M) the left and right
>>>>>> eigenvectors are identical. But that is not exactly what I want. The
>>>>>> lanczos solver from mahout gives me the eigenvectors of M*M', which
>>>>>> are the left singular vectors of M. But I need the right singular
>>>>>> vectors of M (and not M*M'). How do I get them?
>>>>>>
>>>>>> Sorry, my matrix math is not as good as it should be, but I hope you
>>>>>> can help me!
>>>>>>
>>>>>> Thanks,
>>>>>> Stefan
>>>>>>
>>>>>> 2011/6/6 Danny Bickson <danny.bick...@gmail.com>:
>>>>>> > Hi Stefan!
>>>>>> > For a positive semidefinite matrix, the lest and right eigenvectors are
>>>>>> > identical.
>>>>>> > See SVD wikipeida text: When *M* is also positive
>>>>>> > semi-definite<http://en.wikipedia.org/wiki/Positive-definite_matrix>,
>>>>>> > the decomposition *M* = *U**D**U* * is also a singular value
>>>>>> decomposition.
>>>>>> > So you don't need to be worried about the other singular vectors.
>>>>>> >
>>>>>> > Hope this helps!
>>>>>> >
>>>>>> > On Mon, Jun 6, 2011 at 12:57 PM, Stefan Wienert <ste...@wienert.cc>
>>>>>> wrote:
>>>>>> >
>>>>>> >> Hi.
>>>>>> >>
>>>>>> >> Thanks for the help.
>>>>>> >>
>>>>>> >> The important points from wikipedia are:
>>>>>> >> - The left singular vectors of M are eigenvectors of M*M' .
>>>>>> >> - The right singular vectors of M are eigenvectors of M'*M.
>>>>>> >>
>>>>>> >> as you describe, the mahout lanczos solver calculate A=M'*M (I think
>>>>>> >> it does A=M*M', but it is not a problem). Therefore it does already
>>>>>> >> calculate the right (or left) singular vector of M.
>>>>>> >>
>>>>>> >> But my question is, how can I get the other singular vector? I can
>>>>>> >> transpose M, but then I have to calculated two SVDs, one for the right
>>>>>> >> and one for the left singular value... I think there is a better way
>>>>>> >> :)
>>>>>> >>
>>>>>> >> Hope you can help me with this...
>>>>>> >> Thanks
>>>>>> >> Stefan
>>>>>> >>
>>>>>> >>
>>>>>> >> 2011/6/6 Danny Bickson <danny.bick...@gmail.com>:
>>>>>> >> > Hi
>>>>>> >> > Mahout SVD implementation computes the Lanzcos iteration:
>>>>>> >> > http://en.wikipedia.org/wiki/Lanczos_algorithm
>>>>>> >> > Denote the non-square input matrix as M. First a symmetric matrix A 
>>>>>> >> > is
>>>>>> >> > computed by A=M'*M
>>>>>> >> > Then an approximating tridiagonal matrix T and a vector matrix V are
>>>>>> >> > computed such that A =~ V*T*V'
>>>>>> >> > (this process is done in a distributed way).
>>>>>> >> >
>>>>>> >> > Next the matrix T is next decomposed into eigenvectors and
>>>>>> eignevalues.
>>>>>> >> > Which is the returned result. (This process
>>>>>> >> > is serial).
>>>>>> >> >
>>>>>> >> > The third step makes the returned eigenvectors orthogonal to each
>>>>>> other
>>>>>> >> > (which is optional IMHO).
>>>>>> >> >
>>>>>> >> > The heart of the code is found at:
>>>>>> >> >
>>>>>> >>
>>>>>> ./math/src/main/java/org/apache/mahout/math/decomposer/lanczos/LanczosSolver.java
>>>>>> >> > At least that is where it was in version 0.4 I am not sure if there
>>>>>> are
>>>>>> >> > changes in version 0.5
>>>>>> >> >
>>>>>> >> > Anyway, Mahout does not compute directly SVD. If you are interested 
>>>>>> >> > in
>>>>>> >> > learning more about the relation to SVD
>>>>>> >> > look at: http://en.wikipedia.org/wiki/Singular_value_decomposition,
>>>>>> >> > subsection: relation to eigenvalue decomposition.
>>>>>> >> >
>>>>>> >> > Hope this helps,
>>>>>> >> >
>>>>>> >> > Danny Bickson
>>>>>> >> >
>>>>>> >> > On Mon, Jun 6, 2011 at 9:35 AM, Stefan Wienert <ste...@wienert.cc>
>>>>>> >> wrote:
>>>>>> >> >
>>>>>> >> >> After reading this thread:
>>>>>> >> >>
>>>>>> >> >>
>>>>>> >>
>>>>>> http://mail-archives.apache.org/mod_mbox/mahout-user/201102.mbox/%3caanlktinq5k4xrm7nabwn8qobxzgvobbot2rtjzsv4...@mail.gmail.com%3E
>>>>>> >> >>
>>>>>> >> >> Wiki-SVD: M = U S V* (* = transposed)
>>>>>> >> >>
>>>>>> >> >> The output of Mahout-SVD is (U S) right?
>>>>>> >> >>
>>>>>> >> >> So... How do I get V from (U S)  and M?
>>>>>> >> >>
>>>>>> >> >> Is V = M (U S)* (because this is, what the calculation in the 
>>>>>> >> >> example
>>>>>> >> is)?
>>>>>> >> >>
>>>>>> >> >> Thanks
>>>>>> >> >> Stefan
>>>>>> >> >>
>>>>>> >> >> 2011/6/6 Stefan Wienert <ste...@wienert.cc>:
>>>>>> >> >> >
>>>>>> >>
>>>>>> https://cwiki.apache.org/confluence/display/MAHOUT/Dimensional+Reduction
>>>>>> >> >> >
>>>>>> >> >> > What is done:
>>>>>> >> >> >
>>>>>> >> >> > Input:
>>>>>> >> >> > tf-idf-matrix (docs x terms) 6076937 x 20444
>>>>>> >> >> >
>>>>>> >> >> > "SVD" of tf-idf-matrix (rank 100) produces the eigenvector (and
>>>>>> >> >> > eigenvalues) of tf-idf-matrix, called:
>>>>>> >> >> > svd (concepts x terms) 87 x 20444
>>>>>> >> >> >
>>>>>> >> >> > transpose tf-idf-matrix:
>>>>>> >> >> > tf-idf-matrix-transpose (terms x docs) 20444 x 6076937
>>>>>> >> >> >
>>>>>> >> >> > transpose svd:
>>>>>> >> >> > svd-transpose (terms x concepts) 20444 x 87
>>>>>> >> >> >
>>>>>> >> >> > matrix multiply:
>>>>>> >> >> > tf-idf-matrix-transpose x svd-transpose = result
>>>>>> >> >> > (terms x docs) x (terms x concepts) = (docs x concepts)
>>>>>> >> >> >
>>>>>> >> >> > so... I do understand, that the "svd" here is not SVD from
>>>>>> wikipedia.
>>>>>> >> >> > It only does the Lanczos algorithm and some magic which produces
>>>>>> the
>>>>>> >> >> >> Instead either the left or right (but usually the right)
>>>>>> eigenvectors
>>>>>> >> >> premultiplied by the diagonal or the square root of the
>>>>>> >> >> >> diagonal element.
>>>>>> >> >> > from
>>>>>> >> >>
>>>>>> >>
>>>>>> http://mail-archives.apache.org/mod_mbox/mahout-user/201102.mbox/%3CAANLkTi=rta7tfrm8zi60vcfya5xf+dbfrj8pcds2n...@mail.gmail.com%3E
>>>>>> >> >> >
>>>>>> >> >> > so my question: what is the output of the SVD in mahout. And what
>>>>>> do I
>>>>>> >> >> > have to calculate to get the "right singular value" from svd?
>>>>>> >> >> >
>>>>>> >> >> > Thanks,
>>>>>> >> >> > Stefan
>>>>>> >> >> >
>>>>>> >> >> > 2011/6/6 Stefan Wienert <ste...@wienert.cc>:
>>>>>> >> >> >>
>>>>>> >> >>
>>>>>> >>
>>>>>> https://cwiki.apache.org/confluence/display/MAHOUT/Dimensional+Reduction
>>>>>> >> >> >>
>>>>>> >> >> >> the last step is the matrix multiplication:
>>>>>> >> >> >>  --arg --numRowsA --arg 20444 \
>>>>>> >> >> >>  --arg --numColsA --arg 6076937 \
>>>>>> >> >> >>  --arg --numRowsB --arg 20444 \
>>>>>> >> >> >>  --arg --numColsB --arg 87 \
>>>>>> >> >> >> so the result is a 6,076,937 x 87 matrix
>>>>>> >> >> >>
>>>>>> >> >> >> the input has 6,076,937 (each with 20,444 terms). so the result 
>>>>>> >> >> >> of
>>>>>> >> >> >> matrix multiplication has to be the right singular value 
>>>>>> >> >> >> regarding
>>>>>> to
>>>>>> >> >> >> the dimensions.
>>>>>> >> >> >>
>>>>>> >> >> >> so the result is the "concept-document vector matrix" (as I 
>>>>>> >> >> >> think,
>>>>>> >> >> >> these is also called "document vectors" ?)
>>>>>> >> >> >>
>>>>>> >> >> >> 2011/6/6 Ted Dunning <ted.dunn...@gmail.com>:
>>>>>> >> >> >>> Yes.  These are term vectors, not document vectors.
>>>>>> >> >> >>>
>>>>>> >> >> >>> There is an additional step that can be run to produce document
>>>>>> >> >> vectors.
>>>>>> >> >> >>>
>>>>>> >> >> >>> On Sun, Jun 5, 2011 at 1:16 PM, Stefan Wienert 
>>>>>> >> >> >>> <ste...@wienert.cc
>>>>>> >
>>>>>> >> >> wrote:
>>>>>> >> >> >>>
>>>>>> >> >> >>>> compared to SVD, is the result is the "right singular value"?
>>>>>> >> >> >>>>
>>>>>> >> >> >>>
>>>>>> >> >> >>
>>>>>> >> >> >>
>>>>>> >> >> >>
>>>>>> >> >> >> --
>>>>>> >> >> >> Stefan Wienert
>>>>>> >> >> >>
>>>>>> >> >> >> http://www.wienert.cc
>>>>>> >> >> >> ste...@wienert.cc
>>>>>> >> >> >>
>>>>>> >> >> >> Telefon: +495251-2026838
>>>>>> >> >> >> Mobil: +49176-40170270
>>>>>> >> >> >>
>>>>>> >> >> >
>>>>>> >> >> >
>>>>>> >> >> >
>>>>>> >> >> > --
>>>>>> >> >> > Stefan Wienert
>>>>>> >> >> >
>>>>>> >> >> > http://www.wienert.cc
>>>>>> >> >> > ste...@wienert.cc
>>>>>> >> >> >
>>>>>> >> >> > Telefon: +495251-2026838
>>>>>> >> >> > Mobil: +49176-40170270
>>>>>> >> >> >
>>>>>> >> >>
>>>>>> >> >>
>>>>>> >> >>
>>>>>> >> >> --
>>>>>> >> >> Stefan Wienert
>>>>>> >> >>
>>>>>> >> >> http://www.wienert.cc
>>>>>> >> >> ste...@wienert.cc
>>>>>> >> >>
>>>>>> >> >> Telefon: +495251-2026838
>>>>>> >> >> Mobil: +49176-40170270
>>>>>> >> >>
>>>>>> >> >
>>>>>> >>
>>>>>> >>
>>>>>> >>
>>>>>> >> --
>>>>>> >> Stefan Wienert
>>>>>> >>
>>>>>> >> http://www.wienert.cc
>>>>>> >> ste...@wienert.cc
>>>>>> >>
>>>>>> >> Telefon: +495251-2026838
>>>>>> >> Mobil: +49176-40170270
>>>>>> >>
>>>>>> >
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Stefan Wienert
>>>>>>
>>>>>> http://www.wienert.cc
>>>>>> ste...@wienert.cc
>>>>>>
>>>>>> Telefon: +495251-2026838
>>>>>> Mobil: +49176-40170270
>>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Stefan Wienert
>>>>
>>>> http://www.wienert.cc
>>>> ste...@wienert.cc
>>>>
>>>> Telefon: +495251-2026838
>>>> Mobil: +49176-40170270
>>>>
>>>
>>
>>
>>
>> --
>> Stefan Wienert
>>
>> http://www.wienert.cc
>> ste...@wienert.cc
>>
>> Telefon: +495251-2026838
>> Mobil: +49176-40170270
>>
>



-- 
Stefan Wienert

http://www.wienert.cc
ste...@wienert.cc

Telefon: +495251-2026838
Mobil: +49176-40170270

Reply via email to