I have been intermittently following this point. Some folks have said that having higher dimensional SVD's should change the distribution of distances.
Actually, that isn't quite true. SVD preserves dot products as much as possible. With lower dimensional projections you lose some information, but as the singular values decline, you lose less and less information. It *is* however true that *random* unit vectors in higher dimension have a dot product that is more and more tightly clustered around zero. This is a different case entirely from the case that we are talking about where you have real data projected down into a lower dimensional space. On Wed, Jun 15, 2011 at 7:44 PM, Jake Mannix <[email protected]> wrote: > On Wed, Jun 15, 2011 at 10:06 AM, Stefan Wienert <[email protected]> > wrote: > > > Hmm. Seems I have plenty of negative results (nearly half of the > > similarity). I can add +0.3 then the greatest negative results are > > near 0. This is not optimal... > > I can project the results to [0..1]. > > > > Looking for *dissimilar* results seems odd. What are you trying to do? > > What people normally do is look for clusters of similar documents, or > just the top-N most similar documents to each document. In both of these > cases, you don't care about the documents whose similarity to anyone is > zero, or less than zero. > > -jake > > > > Any other suggestions or comments? > > > > Cheers > > Stefan > > > > 2011/6/15 Jake Mannix <[email protected]>: > > > While your original vectors never had similarity less than zero, after > > > projection onto the SVD space, you may "project away" similarities > > > between two vectors, and they are now negatively correlated in this > > > space (think about projecting (1,0,1) and (0,1,1) onto the 1-d vector > > > space spanned by (1,-1,0) - they go from having similarity +1/sqrt(2) > > > to similarity -1). > > > > > > I always interpret all similarities <= 0 as "maximally dissimilar", > > > even if technically -1 is where this is exactly true. > > > > > > -jake > > > > > > On Wed, Jun 15, 2011 at 2:10 AM, Stefan Wienert <[email protected]> > > wrote: > > > > > >> Ignoring is no option... so I have to interpret these values. > > >> Can one say that documents with similarity = -1 are the less similar > > >> documents? I don't think this is right. > > >> Any other assumptions? > > >> > > >> 2011/6/15 Fernando Fernández <[email protected]>: > > >> > One question that I think it has not been answered yet is that of > the > > >> > negative simliarities. In literature you can find that > similiarity=-1 > > >> means > > >> > that "documents talk about opposite topics", but I think this is a > > quite > > >> > abstract idea... I just ignore them, when I'm trying to find top-k > > >> similar > > >> > documents these surely won't be useful. I read recently that this > has > > to > > >> do > > >> > with the assumptions in SVD which is designed for normal > distributions > > >> (This > > >> > implies the posibility of negative values). There are other > techniques > > >> > (Non-negative factorization) that tries to solve this. I don't know > if > > >> > there's something in mahout about this. > > >> > > > >> > Best, > > >> > > > >> > Fernando. > > >> > > > >> > 2011/6/15 Ted Dunning <[email protected]> > > >> > > > >> >> The normal terminology is to name U and V in SVD as "singular > > vectors" > > >> as > > >> >> opposed to eigenvectors. The term eigenvectors is normally > reserved > > for > > >> >> the > > >> >> symmetric case of U S U' (more generally, the Hermitian case, but > we > > >> only > > >> >> support real values). > > >> >> > > >> >> On Wed, Jun 15, 2011 at 12:35 AM, Dmitriy Lyubimov < > > [email protected] > > >> >> >wrote: > > >> >> > > >> >> > I beg to differ... U and V are left and right eigenvectors, and > > >> >> > singular values is denoted as Sigma (which is a square root of > > eigen > > >> >> > values of the AA' as you correctly pointed out) . > > >> >> > > > >> >> > > >> > > > >> > > >> > > >> > > >> -- > > >> Stefan Wienert > > >> > > >> http://www.wienert.cc > > >> [email protected] > > >> > > >> Telefon: +495251-2026838 > > >> Mobil: +49176-40170270 > > >> > > > > > > > > > > > -- > > Stefan Wienert > > > > http://www.wienert.cc > > [email protected] > > > > Telefon: +495251-2026838 > > Mobil: +49176-40170270 > > >
