Hi,
As I've mentioned in the past, I'm working on clustering documents
(albeit relatively small ones). The cluster mechanism I've ended up
with has produced some pretty good results (at least for what I need
to be able to do). However, what I'd like to be able to do is find a
way to automate the naming of these groups.
For example, if each document has a 6/7 word title, I'd like to
produce names that are somewhat logically ordered (that is they make
grammatical sense, this can probably be inferred by the frequency in
the clusters: most documents in a cluster should be well-formed) and
share terms across the majority of the titles.
So far, I'm using a kind of hacked-together longest common substring
method:
* Sort the titles within the cluster
* Compare every string against every other string, producing a LCS value
* Use the most common LCS
As this is all relatively new ground for me, I was wondering whether
there were any better methods I could be using?
Thanks,
Paul