OK, this is a bit late, I had a pile of things that were due today. 
Fortunately, I'm almost done with my graduate classes (Friday), which 
will be quite nice.

Scoring in version 3.2:

In versions 3.1 and before, scoring was done at the time of indexing. 
This made scoring during the search quite easy (it was mostly 
pre-computed), but is a real hassle if you're trying to optimize the 
default scoring factors. Since the defaults are by no means the best 
possible values for all people, this essentially prevents 
experimentation.

As outlined in previous overviews, the words themselves in 3.2 are 
stored with a set of "flags" representing the context. So the flags 
are associated with various factors and currently, htsearch loops 
through and sums up the factors for each matching word in a document. 
Note that unlike versions before 3.2, the position in the document 
doesn't play a part in scoring. (Previous versions scaled the 
character position from 1-1000 and gave a factor of 1000 to appearing 
in the beginning and decreasing down to a factor of 1 to appearing at 
the end.)

So let's run through the scoring for two words, foo and foobar. Let's 
say for the sake of argument that foobar was generated by a fuzzy 
algorithm and has a search_algorithm weighting of 0.5.

Now in document A, "foo" occurs 10 times, with total weight 350 and 
"foobar" occurs 5 times, with total weight 200 (e.g. they all appear 
as headers). Let's also say in the total database, "foo" occurs 250 
times and "foobar" occurs 100 times.

Without referring to a formula, we know that we have to balance the 
number of occurrences in the document against how common the word is. 
Currently, it's difficult to work out the number of occurrences in a 
document. However, it's easy to work out the total number of 
occurences in a word.

So for document A, the score from the words is goes about like this:
Sum(Fuzzy_Factor * Word_Weight / Total_Word_Frequency)

word_score = 1*350/250 + 0.5*200/100 = 1.4 + 1 = 2.4

Currently, there are two non-word factors: backlink_factor and 
date_factor. Another reasonable one would be hopcount_factor, and of 
course Hans-Peter's url_seed_score modifications would fit in here as 
well. These simply add to the document weighting based on other 
attributes of the document.

In the current code, before reporting the score (and sorting), 
htsearch takes the natural log of this value. Why? This is an attempt 
to make it a bit more even--you have to have an order of magnitude 
more weight to have a factor more score. This doesn't entirely 
balance out the extra weight given to long documents, but it helps.

There are, of course, many variations on this theme. Almost any IR 
book will describe a few variants. However, this improves on previous 
scoring mechanisms by taking total word frequency into account and 
attempting to balance out long documents. Testing would be helpful to 
see if it actually works!

-Geoff

------------------------------------
To unsubscribe from the htdig3-dev mailing list, send a message to
[EMAIL PROTECTED] 
You will receive a message to confirm this. 

Reply via email to