OK, this is a bit late, I had a pile of things that were due today. Fortunately, I'm almost done with my graduate classes (Friday), which will be quite nice. Scoring in version 3.2: In versions 3.1 and before, scoring was done at the time of indexing. This made scoring during the search quite easy (it was mostly pre-computed), but is a real hassle if you're trying to optimize the default scoring factors. Since the defaults are by no means the best possible values for all people, this essentially prevents experimentation. As outlined in previous overviews, the words themselves in 3.2 are stored with a set of "flags" representing the context. So the flags are associated with various factors and currently, htsearch loops through and sums up the factors for each matching word in a document. Note that unlike versions before 3.2, the position in the document doesn't play a part in scoring. (Previous versions scaled the character position from 1-1000 and gave a factor of 1000 to appearing in the beginning and decreasing down to a factor of 1 to appearing at the end.) So let's run through the scoring for two words, foo and foobar. Let's say for the sake of argument that foobar was generated by a fuzzy algorithm and has a search_algorithm weighting of 0.5. Now in document A, "foo" occurs 10 times, with total weight 350 and "foobar" occurs 5 times, with total weight 200 (e.g. they all appear as headers). Let's also say in the total database, "foo" occurs 250 times and "foobar" occurs 100 times. Without referring to a formula, we know that we have to balance the number of occurrences in the document against how common the word is. Currently, it's difficult to work out the number of occurrences in a document. However, it's easy to work out the total number of occurences in a word. So for document A, the score from the words is goes about like this: Sum(Fuzzy_Factor * Word_Weight / Total_Word_Frequency) word_score = 1*350/250 + 0.5*200/100 = 1.4 + 1 = 2.4 Currently, there are two non-word factors: backlink_factor and date_factor. Another reasonable one would be hopcount_factor, and of course Hans-Peter's url_seed_score modifications would fit in here as well. These simply add to the document weighting based on other attributes of the document. In the current code, before reporting the score (and sorting), htsearch takes the natural log of this value. Why? This is an attempt to make it a bit more even--you have to have an order of magnitude more weight to have a factor more score. This doesn't entirely balance out the extra weight given to long documents, but it helps. There are, of course, many variations on this theme. Almost any IR book will describe a few variants. However, this improves on previous scoring mechanisms by taking total word frequency into account and attempting to balance out long documents. Testing would be helpful to see if it actually works! -Geoff ------------------------------------ To unsubscribe from the htdig3-dev mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this.
