I just woke up thinking it would be cool to attempt reducing the data
of all documents using PCA (or so) and store the output in a new
field per dimention introduced in order to find similair documents by
placing a simple proximity query. Did anyone attempt something like
this?
I did not
adds extra complexity/cost but
might be an interesting avenue to explore for some apps when selecting
distinguishing characteristics or weighting query results.
Cheers
Mark
- Original Message
From: karl wettin <[EMAIL PROTECTED]>
To: java-user@lucene.apache.org
Sent: Friday, 9
> For example, given terms "female", "John" and "London" - all 3 may
> have equal IDF but should a document representing a female in London
> be given equal weighting to a document representing the rarer example
> of a female who happens to be called "John"?
Not to mention multi-word phrase tokeni