Re: IDFrequency

Soeren Pekrul Fri, 02 Feb 2007 14:23:32 -0800

DECAFFMEYER MATHIEU wrote:

The score depends of
1. the query
2. the matched document
3. the index.
I don't really understand why the index must influence the score (why itahs been implemented that way).

The score should be the similarity (inverse distance) between the queryand the matched document. How similar is the found document to my query?How likely is the found document relevant for my question (query)?

If your query consists of just one word (term) the idf has no influence.If the query consists of multiple terms it could be useful weighting theterms. The idea is as follow:


1. Indexing view

The task is to find important words in a document, to find the keywordsdescribing this document.A term that occurs in just one document identifies that document. Thisterm seems to be very important for that document. It could be a goodkeyword candidate.If a term occurs in all documents (like stop words) it can't describe adocument because there is no difference to the other documents.


2. Query view

A term that occurs in just one document identifies that document. Yourquery will return exactly that document, a perfect result. No ranking isnecessary.If you are searching for a term that occurs in all documents you willretrieve the complete collection. You have no selection, no subcollection. You have the same situation as before your query. This termis not a real help to find an answer of a question. The weight of thisterm could be 0 or very small.If a term has a small document frequency the weight is high and if ithas a large document frequency it has a less weight.

A lot of experiments show that score=tf*idf is a quite good rankingmethod. It is not the best for all cases but not bad for the generalcase. You can use it or not. It depends of your requirements.

Let's say I have this page Logistics.htm
I have just one time the word "experience" in it.
It will get a high score because of the IDF but it occurs only once inmy document.

Did you really mean the IDF? That looks for me like TF (term frequency),how often a term occurs in a document. The IDF (inverse documentfrequency) means in how many documents occurs the term in my collection.The idea of tf is if you have already removed the stop words a term thatoccurs quite often in a document is more important for that documentthan a term that occurs quite rare.


Sören

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: IDFrequency

Reply via email to