DECAFFMEYER MATHIEU wrote:
The score depends of
1. the query
2. the matched document
3. the index.

I don't really understand why the index must influence the score (why it ahs been implemented that way).

The score should be the similarity (inverse distance) between the query and the matched document. How similar is the found document to my query? How likely is the found document relevant for my question (query)?

If your query consists of just one word (term) the idf has no influence. If the query consists of multiple terms it could be useful weighting the terms. The idea is as follow:

1. Indexing view
The task is to find important words in a document, to find the keywords describing this document. A term that occurs in just one document identifies that document. This term seems to be very important for that document. It could be a good keyword candidate. If a term occurs in all documents (like stop words) it can't describe a document because there is no difference to the other documents.

2. Query view
A term that occurs in just one document identifies that document. Your query will return exactly that document, a perfect result. No ranking is necessary. If you are searching for a term that occurs in all documents you will retrieve the complete collection. You have no selection, no sub collection. You have the same situation as before your query. This term is not a real help to find an answer of a question. The weight of this term could be 0 or very small. If a term has a small document frequency the weight is high and if it has a large document frequency it has a less weight.

A lot of experiments show that score=tf*idf is a quite good ranking method. It is not the best for all cases but not bad for the general case. You can use it or not. It depends of your requirements.

Let's say I have this page Logistics.htm
I have just one time the word "experience" in it.
It will get a high score because of the IDF but it occurs only once in my document.

Did you really mean the IDF? That looks for me like TF (term frequency), how often a term occurs in a document. The IDF (inverse document frequency) means in how many documents occurs the term in my collection. The idea of tf is if you have already removed the stop words a term that occurs quite often in a document is more important for that document than a term that occurs quite rare.

Sören

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to