Marvin Humphrey wrote:
BTW, clustering in Information Retrieval usually implies grouping by
vector distance using statistical methods:
http://en.wikipedia.org/wiki/Data_clustering
In general, all you need is objects with
a pairwise similarity (dissimilarity) measure.
With (term) vectors, that's usually one of
the multitude of TF/IDF cosine measures, whereas
in other machine learning apps it's typically
Euclidean distance (often z-score normalized to
scale the dimensions).
For the more sophisticated clustering algorithms,
like EM (soft/model-based) clustering, you can
use similarities between clusters (instead of
deriving these from similarities between items).
Exactly. I'd scanned this, but I haven't yet familiarized myself with
the different models.
It may be possible for both keyword fields e.g. "host" and non- keyword
fields e.g. "content" to be clustered using the same algorithm and an
interface like Hits.cluster(String fieldname, int docsPerCluster).
Retrieve each hit's vector for the specified field, and map the docs
into a unified term space, then cluster. For "host" or any other
keyword field, the boundaries will be stark and the cost of calculation
negligible. For "content", a more sophisticated model will be required
to group the docs and the cost will be greater.
This is an issue of scaling the different dimensions.
You can "boost" the dimensions any way you want just
like other vector-based search operations.
It is more expensive to calculate similarity based on the entire
document's contents rather than just a snippet chosen by the
Highlighter. However, it's presumably more accurate, and having the
Term Vectors pre-built at index time should help quite a bit.
This varies, actually, depending on the document. If
you grab HTML from a portal, and use it all, pages from
that portal will tend to cluster together. If you just
use snippets of text around document passages that
match your query, you can actually get more accurate clustering relative
to your query. It really depends if the documents are
single-topic and coherent. If so, use them all; if not,
use snippets. [You can see this problem leading the
Google news classifier astray on occasion.]
A typical way to approximate is by only taking high TF/IDF
terms. Principal component methods are also popular (e.g.
latent semantic indexing) to reduce dimensionality (usually
with a least-squares fit criterion).
A more extreme way to approximate is with signature
files (e.g. to do web-scale "more documents like this"),
but Lucene's not going to help you there. Check out
"Managing Gigabytes" for more on this approach.
- Bob Carpenter
Alias-i
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]