Re: Returning a minimum number of clusters

carp Tue, 02 May 2006 13:55:09 -0700

Marvin Humphrey wrote:

BTW, clustering in Information Retrieval usually implies grouping byvector distance using statistical methods:
http://en.wikipedia.org/wiki/Data_clustering


In general, all you need is objects with
a pairwise similarity (dissimilarity) measure.
With (term) vectors, that's usually one of
the multitude of TF/IDF cosine measures, whereas
in other machine learning apps it's typically
Euclidean distance (often z-score normalized to
scale the dimensions).

For the more sophisticated clustering algorithms,
like EM (soft/model-based) clustering, you can
use similarities between clusters (instead of
deriving these from similarities between items).

Exactly. I'd scanned this, but I haven't yet familiarized myself withthe different models.
It may be possible for both keyword fields e.g. "host" and non- keywordfields e.g. "content" to be clustered using the same algorithm and aninterface like Hits.cluster(String fieldname, int docsPerCluster).Retrieve each hit's vector for the specified field, and map the docsinto a unified term space, then cluster. For "host" or any otherkeyword field, the boundaries will be stark and the cost of calculationnegligible. For "content", a more sophisticated model will be requiredto group the docs and the cost will be greater.


This is an issue of scaling the different dimensions.
You can "boost" the dimensions any way you want just
like other vector-based search operations.

It is more expensive to calculate similarity based on the entiredocument's contents rather than just a snippet chosen by theHighlighter. However, it's presumably more accurate, and having theTerm Vectors pre-built at index time should help quite a bit.


This varies, actually, depending on the document.  If
you grab HTML from a portal, and use it all, pages from
that portal will tend to cluster together.  If you just
use snippets of text around document passages that
match your query, you can actually get more accurate clustering relative
to your query.  It really depends if the documents are
single-topic and coherent.  If so, use them all; if not,
use snippets.  [You can see this problem leading the
Google news classifier astray on occasion.]

A typical way to approximate is by only taking high TF/IDF
terms.  Principal component methods are also popular (e.g.
latent semantic indexing) to reduce dimensionality (usually
with a least-squares fit criterion).

A more extreme way to approximate is with signature
files (e.g. to do web-scale "more documents like this"),
but Lucene's not going to help you there.  Check out
"Managing Gigabytes" for more on this approach.

- Bob Carpenter
  Alias-i

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Returning a minimum number of clusters

Reply via email to