Hi Ted, An observation from the results.
Previously, I just took the top 10 features of the centroid vector as labels. The order was decided by the weight of the feature in the centroid vector. Now, when I see the top phrases with LLR method, I see there is an overlap between these two result sets. They results are not exactly same, but most of the top features of vector score high with LLR technique as well. In few cases, I can see LLR technique comes up with phrases which are very different. A simple interpretation of this can be given by the fact that when a phrase is quite common in a small cluster but uncommon out of cluster, it is going to have a higher (TF-IDF) weight in the document vector. Such a phrase is identified as prominent by LLR as well. Is there any other reason for this to occur? Thanks, --shashi
