On Sun, Jan 10, 2010 at 8:39 PM, David Hall <[email protected]> wrote: > > > In some sense, I've come to believe that assigning a label to a topic > reifies it more than it really deserves to be. Topics are in a lot of > ways like eigenvectors/eigenfaces; you don't really assign a name (or > even a visual word) to the fourth eigenface, even if it looks like it > might be smiling a little bit... >
Yeah, this is something that has been nagging at me for a while whenever these questions of "human interpretable labels" for clusters/topics/eigenvectors, while I don't have enough deep familiarity with all of the techniques involved to say how it relates in all cases, I can say this: for the case of eigenvectors, where if they are texual, you could take out the "top-k terms", or if they are faces, you could try to pick out the "top-k facial structures", the problem of mixing is pretty significant: given two eigenvectors e1, e2, with eigenvalues a1, a2, then even when a1 = 2 * a2, the vector v = e1 + (e2 / 2) satisfies the eigenvector criterion with an error of only about 3% (meaning the cosine between v and M*v is about 0.97, compared to 1.0 for exact eigenvectors, and compared to roughly 1/sqrt(num_dimensions) for two randomly chosen unit vectors). What this means, in practical terms, is that when you do real large scale decompositions (and I'm thinking this is similar with LDA and the like), numerical errors and imperfect convergence leads to finding a great eigen-*space*, but the actual basis vectors you've found in it can be much more of a mix with each other than you might imagine (think about it: take the top-k terms from one eigenvector, and the top-k terms from another, and now consider a mixture of the two vectors with two-parts the first eigenvector and one part the next - the top-k terms of this linear combination could be a considerably different set). Of course, you can turn this criticism on its head, and instead say that you could take any slightly rotated basis of your originally found one instead, and use this to pick a basis specifically *because* it is more interpretable than others. Of course, finding an efficient way to do that might be more challenging than the original problem of computing the decomposition in the first place. -jake
