Re: svn commit: r405565 - in /lucene/nutch/trunk/src: java/org/apache/nutch/searcher/ test/org/apache/nutch/searcher/ web/jsp/

Dawid Weiss Fri, 12 May 2006 05:33:30 -0700


Hi Jerome,

Yes Dawid, but it is already committed => the clustering now uses the plain
text version returned by the toString() method.

Ugh, yes, sorry about that, it uses Summary.toStrings(summaries) to bespecific and that uses toString internally.

Actually, the clustering uses the summaries as input. I assumes it would
provides some better results if it takes the whole documents content. no?
I assumes that clustering uses the summaries instead of documents content
for some performances purpose.

Not always. Or rather: depends what your goals are. Full documentclustering will take longer (word segmentation, feature extraction etc),but since you have more data to work with, document similarity should bemore accurate and hence clusters more sensible. In practice, however,similarity between documents and "cluster quality" is just amathematical concept which is never shown to the user -- what the usersees is the representation of a cluster, which in case of full-documentclustering is usually quite inconvenient to build and has a weakrelationship with the actual mathematical model of clusters.

Contextual (keyword-in-context) snippets have a great advantage: theyare shorter and carry the neighborhood of your query's terms. This veryneighborhood (or rather: repetitive sequences of terms) can be used tofirst determine "clusters" of documents and then to describe them to theuser. This is how most Web clustering algorithms work (excuse me if Iexplained it in a very imprecise way).

But there is a (bad) side effect : since the size of the summaries is
configurable, the clustering "quality" will vary depending on the summaries
size configuration. I really found this very confusing : when folks adjust
this parameter it is only for front-end consideration (they want to display
a long or a short summary), but certainly not for clustering reasons.

You're right -- changing anything with the input (snippets length,number of documents etc) will alter the clusters. This is basically howit works. If you want clustering in your search engine then, dependingon the type of data you serve, you'll have to experiment with thesettings a bit and see which give you satisfactory results. I don'tthink there is any particular reason to provide different data to theclusterer. Moreover, it'd complicate things quite badly.

D.

Re: svn commit: r405565 - in /lucene/nutch/trunk/src: java/org/apache/nutch/searcher/ test/org/apache/nutch/searcher/ web/jsp/

Reply via email to