Hi Jerome,

Yes Dawid, but it is already committed => the clustering now uses the plain
text version returned by the toString() method.

Ugh, yes, sorry about that, it uses Summary.toStrings(summaries) to be specific and that uses toString internally.

Actually, the clustering uses the summaries as input. I assumes it would
provides some better results if it takes the whole documents content. no?
I assumes that clustering uses the summaries instead of documents content
for some performances purpose.

Not always. Or rather: depends what your goals are. Full document clustering will take longer (word segmentation, feature extraction etc), but since you have more data to work with, document similarity should be more accurate and hence clusters more sensible. In practice, however, similarity between documents and "cluster quality" is just a mathematical concept which is never shown to the user -- what the user sees is the representation of a cluster, which in case of full-document clustering is usually quite inconvenient to build and has a weak relationship with the actual mathematical model of clusters.

Contextual (keyword-in-context) snippets have a great advantage: they are shorter and carry the neighborhood of your query's terms. This very neighborhood (or rather: repetitive sequences of terms) can be used to first determine "clusters" of documents and then to describe them to the user. This is how most Web clustering algorithms work (excuse me if I explained it in a very imprecise way).

But there is a (bad) side effect : since the size of the summaries is
configurable, the clustering "quality" will vary depending on the summaries
size configuration. I really found this very confusing : when folks adjust
this parameter it is only for front-end consideration (they want to display
a long or a short summary), but certainly not for clustering reasons.

You're right -- changing anything with the input (snippets length, number of documents etc) will alter the clusters. This is basically how it works. If you want clustering in your search engine then, depending on the type of data you serve, you'll have to experiment with the settings a bit and see which give you satisfactory results. I don't think there is any particular reason to provide different data to the clusterer. Moreover, it'd complicate things quite badly.

D.




Reply via email to