Hi Ted,

An observation from the results.

Previously, I just took the top 10 features of the centroid vector as
labels.  The order was decided by the weight of the feature in the
centroid vector.

Now, when I see the top phrases with LLR method, I see there is an
overlap between these two result sets. They results are not exactly
same, but most of the top features of vector score high with LLR
technique as well.  In few cases, I can see LLR technique comes up
with phrases which are very different.

A simple interpretation of this can be given by the fact that when a
phrase is quite common in a small cluster but uncommon out of cluster,
it is going to have a higher (TF-IDF) weight in the document vector.
Such a phrase is identified as prominent by LLR as well.

Is there any other reason for this to occur?

Thanks,

--shashi

Reply via email to