I dont undestand this. What is ahat and how is it different from the input with subtacted mean?
Ssvd pca doesnt form A-M becuase that thing is dense and that's the whole point of pca option, i.e to avoid intermediate ugly and big dense product and svd operation on it. The space required for A-M will be several orders of magnitide bigger than already sufficiently large input and running svd on it will take firever because ssvd flops are f^3. Bottom line ssvd pca cannot produce A-M simply because it never computes it. If you need A-M, i guess you could form it yourself, but in any sound architecture it d be much better to form rows of A-M on the fly instead of computing and storing them. Still though i havent yet figured what you are trying to accomplish. On Sep 10, 2012 7:26 AM, "Pat Ferrel" <[email protected]> wrote: > > Another issue with the SSVD job options is that if you want to use SSVD but keep the input in the original dimention-space (term-space in many cases) you would do the following > * create input matrix A based on input dimensions (terms) > * calculate the full transform, which retains the output in "term-space" AHat = U*Sigma*V^t > * at the end AHat should be in term-space but transformed by DR and Sigma weighted for PCA, right? > * then AHat can be substituted for A where analysis or examination needs the original dimension definitions (terms). > > The problem with the options is that when you set --uHalfSigma OR --vHalfSigma it sets the sVectors to sqrt and that will cause it to be applied to U and V since UJob and VJob only check to see if the sVectors exist and then they both apply them. In other words, either --uHalfSigma is set OR --vHalfSigma will apply sqrt Sigma to BOTH U and V. I dont think this statement is correct. Ill check for it but i am pretty sure this is not how it works. Left and right singular vectors scaled to similar space individually on an explicit option to do so. > > To do U*Sigma*V^t the SSVD code would have to be changed or the U * Sigma would have to be calculated outside SSVD (an ugly alternative). > > But please correct me where I'm wrong. > > > On Sep 8, 2012, at 10:52 AM, Pat Ferrel <[email protected]> wrote: > > I appreciate it and believe that this will help others too. I also agree that we should think this one through to see if it is the correct approach. > > I need to figure out why the row ids/Keys of the input matrix are not getting through clustering. Key/row ids are getting through rowsimilarity when applied to U*S so why not clustering? In other words with rowsimilarity I can map the results back to the original input rows (documents in my case). > > As to the U*S option, I agree. I modified the code to take a new --uSigma but it is mutually exclusive of --uHalfSigma and that indicates that neither option should be a boolean. They both also imply calculation of U. I assume this is what you meant below. > > Another thing to consider and here my ignorance shows through… The U*S (equivalent to A*V) transform to V-space must be reversible so that humans can see results in terms of of the original term-space. Weights base on the new basis are not human understandable really. But setting me straight here may be another conversation. > > On Sep 7, 2012, at 4:52 PM, Dmitriy Lyubimov <[email protected]> wrote: > > I can do a patch to propagate names of named vectors from A to U too > if that's a requirement for what you do. But we need to make sure it > solves your problem. i am still not sure what are IDs in your > definition and what is required for k-means. > > Thinking of that, it's probably a worthy patch anyway. I'll write > something up along with API changes for A*Sigma outputs. I think since > there are so many output options, they should be redesigned not to be > mutually exclusive. > > On Fri, Sep 7, 2012 at 4:37 PM, Pat Ferrel <[email protected]> wrote: > > Yes, I would love to use namedvectors. But no matter doing a key to row lookup is easy enough. > > > > I'm not getting any id at all in the cluster data, not even a key for a row. > > > > I'm beginning to think this is a clustering problem since rowsimilarity at least gives me row keys to identify objects associated with an object. > > > > On Sep 7, 2012, at 2:59 PM, Dmitriy Lyubimov <[email protected]> wrote: > > > > yeah seq2sparse seems to have -nv option now to churn out named > > vectors too. It doesn't seem to be listed in the MIA book though. > > > > On Fri, Sep 7, 2012 at 2:55 PM, Dmitriy Lyubimov <[email protected]> wrote: > >> On Fri, Sep 7, 2012 at 2:27 PM, Dmitriy Lyubimov <[email protected]> wrote: > >> Sequence file keys, on the other hand, is > >>> what populated by seq2sparse output, so they are useful for mapping > >>> results to original documents. > >> > >> Although honestly i am not so sure about seq2sparse anymore. There has > >> been some time since i looked at this for the last time. > > > >
