It is in the doc, but in short, the gist of --pca options is as follows: Normally, PCA conversion is projection onto (offset) orthogonal basis subspace which retains most of variances in the original data. It so happens that such basis corresponds to right singular vectors of (A-M) hence the projection is formed by (A-M)*V, which is approximated by U*Sigma of reduced rank svd of (A-M). (Assuming row vector data points). This is the scope of our PCA task here.
Now, consider a typical document corpus of say 1 million documents. Typically (depending on stemming) total dictionary size would be anywhere between 10^5 to 10^6, but let's assume 10^5 for our purposes. However, indvidual document will perhaps have 100 terms on average. Thus the PCA input for such corpus will be a 1M x 100k sparse matrix but the number of non-zero elements in it will be just 1M x 100 == 100M. This is a fairly small problem for an SSVD to solve. However, we want to run SVD on (A-M) where each row of M is a column mean of A. Column mean is a dense matrix, hence M is a dense matrix too, and so is the (A-M). Then in our hypothetical case we observe that (A-M) takes 1M x 100k non-zero elements, which means it takes 1000 times more space than A and its ssvd will take, asymptotically speaking, (1000)^(3/2) more llops than A, that is almost 5 orders of magnitude more svd computational difference. However, SSVD algorithm may be amended so that it doesn't have to ever compute svd(A-M). I will not go into detail, but in essence the bottom line is that it reduces computational expense to that being equivalent to svd(A), i.e. almost 5 orders of magnitude in our hypothetical corpus PCA case. Hope it makes things clearer. On Mon, Sep 10, 2012 at 7:25 AM, Pat Ferrel <[email protected]> wrote: > Another issue with the SSVD job options is that if you want to use SSVD but > keep the input in the original dimention-space (term-space in many cases) you > would do the following > * create input matrix A based on input dimensions (terms) > * calculate the full transform, which retains the output in "term-space" AHat > = U*Sigma*V^t > * at the end AHat should be in term-space but transformed by DR and Sigma > weighted for PCA, right? > * then AHat can be substituted for A where analysis or examination needs the > original dimension definitions (terms). > > The problem with the options is that when you set --uHalfSigma OR > --vHalfSigma it sets the sVectors to sqrt and that will cause it to be > applied to U and V since UJob and VJob only check to see if the sVectors > exist and then they both apply them. In other words, either --uHalfSigma is > set OR --vHalfSigma will apply sqrt Sigma to BOTH U and V. > > To do U*Sigma*V^t the SSVD code would have to be changed or the U * Sigma > would have to be calculated outside SSVD (an ugly alternative). > > But please correct me where I'm wrong. > > > On Sep 8, 2012, at 10:52 AM, Pat Ferrel <[email protected]> wrote: > > I appreciate it and believe that this will help others too. I also agree that > we should think this one through to see if it is the correct approach. > > I need to figure out why the row ids/Keys of the input matrix are not getting > through clustering. Key/row ids are getting through rowsimilarity when > applied to U*S so why not clustering? In other words with rowsimilarity I can > map the results back to the original input rows (documents in my case). > > As to the U*S option, I agree. I modified the code to take a new --uSigma but > it is mutually exclusive of --uHalfSigma and that indicates that neither > option should be a boolean. They both also imply calculation of U. I assume > this is what you meant below. > > Another thing to consider and here my ignorance shows through… The U*S > (equivalent to A*V) transform to V-space must be reversible so that humans > can see results in terms of of the original term-space. Weights base on the > new basis are not human understandable really. But setting me straight here > may be another conversation. > > On Sep 7, 2012, at 4:52 PM, Dmitriy Lyubimov <[email protected]> wrote: > > I can do a patch to propagate names of named vectors from A to U too > if that's a requirement for what you do. But we need to make sure it > solves your problem. i am still not sure what are IDs in your > definition and what is required for k-means. > > Thinking of that, it's probably a worthy patch anyway. I'll write > something up along with API changes for A*Sigma outputs. I think since > there are so many output options, they should be redesigned not to be > mutually exclusive. > > On Fri, Sep 7, 2012 at 4:37 PM, Pat Ferrel <[email protected]> wrote: >> Yes, I would love to use namedvectors. But no matter doing a key to row >> lookup is easy enough. >> >> I'm not getting any id at all in the cluster data, not even a key for a row. >> >> I'm beginning to think this is a clustering problem since rowsimilarity at >> least gives me row keys to identify objects associated with an object. >> >> On Sep 7, 2012, at 2:59 PM, Dmitriy Lyubimov <[email protected]> wrote: >> >> yeah seq2sparse seems to have -nv option now to churn out named >> vectors too. It doesn't seem to be listed in the MIA book though. >> >> On Fri, Sep 7, 2012 at 2:55 PM, Dmitriy Lyubimov <[email protected]> wrote: >>> On Fri, Sep 7, 2012 at 2:27 PM, Dmitriy Lyubimov <[email protected]> wrote: >>> Sequence file keys, on the other hand, is >>>> what populated by seq2sparse output, so they are useful for mapping >>>> results to original documents. >>> >>> Although honestly i am not so sure about seq2sparse anymore. There has >>> been some time since i looked at this for the last time. >> > >
