Marvin Humphrey wrote:

BTW, clustering in Information Retrieval usually implies grouping by vector distance using statistical methods:

http://en.wikipedia.org/wiki/Data_clustering

In general, all you need is objects with
a pairwise similarity (dissimilarity) measure.
With (term) vectors, that's usually one of
the multitude of TF/IDF cosine measures, whereas
in other machine learning apps it's typically
Euclidean distance (often z-score normalized to
scale the dimensions).

For the more sophisticated clustering algorithms,
like EM (soft/model-based) clustering, you can
use similarities between clusters (instead of
deriving these from similarities between items).

Exactly. I'd scanned this, but I haven't yet familiarized myself with the different models.

It may be possible for both keyword fields e.g. "host" and non- keyword fields e.g. "content" to be clustered using the same algorithm and an interface like Hits.cluster(String fieldname, int docsPerCluster). Retrieve each hit's vector for the specified field, and map the docs into a unified term space, then cluster. For "host" or any other keyword field, the boundaries will be stark and the cost of calculation negligible. For "content", a more sophisticated model will be required to group the docs and the cost will be greater.

This is an issue of scaling the different dimensions.
You can "boost" the dimensions any way you want just
like other vector-based search operations.

It is more expensive to calculate similarity based on the entire document's contents rather than just a snippet chosen by the Highlighter. However, it's presumably more accurate, and having the Term Vectors pre-built at index time should help quite a bit.

This varies, actually, depending on the document.  If
you grab HTML from a portal, and use it all, pages from
that portal will tend to cluster together.  If you just
use snippets of text around document passages that
match your query, you can actually get more accurate clustering relative
to your query.  It really depends if the documents are
single-topic and coherent.  If so, use them all; if not,
use snippets.  [You can see this problem leading the
Google news classifier astray on occasion.]

A typical way to approximate is by only taking high TF/IDF
terms.  Principal component methods are also popular (e.g.
latent semantic indexing) to reduce dimensionality (usually
with a least-squares fit criterion).

A more extreme way to approximate is with signature
files (e.g. to do web-scale "more documents like this"),
but Lucene's not going to help you there.  Check out
"Managing Gigabytes" for more on this approach.

- Bob Carpenter
  Alias-i

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to