Hi!
I've found the results obtained by HTDig with non-zero backlink_factor very
obscure and even misleading. I would like to propose another algorithm
for calculating the relevance.
Suppose someone is looking for a solution of a problem. At first s/he needs
to know different approaches to the problem, so the Web pages providing
the _choice_ (i.e. pages having more outgoing links) appears more important
than the others. On the other hand, the pages having more incoming links
(going from other pages) are more important too. This is why I propose
to calculate the backlink_score of a document according the formula:
backlink_score = 1 + backlink_factor *
(number_of_incoming_links + number_of_outgoing_links);
score = score * backlink_score;
with the backlink_factor about 0.04 (it doubles the backlink_score for a
page with 25 links). Note, when calculating the score I use multiplication
instead of addition.
Another point. I propose to correct also the algorithm for the date_score
calculation because the current algorithm even gave me _negative_ values of
the date_score for some my documents (!?). In my opinion, exponential decay
with a characteristic decay time of 3 years can be a reasonable function
describing the loss of actuality of a Web page (it corresponds to a fivefold
decrease of date_score for a 5-year-old document). Note again that for
calculation of the score I use multiplication instead of addition.
(I think the calculation of an exponent takes no more than 1 microsecond
for modern processors, so it's quite fast).
Probably, the proposed algorithms are not perfect, but if you test it,
you'll find them more relevant than the previous ones (at least from the
end user's point of view).
- Alexander
--------------------------
Here is a simple patch to the Display.cc file (version 3.2.0b4-20021110):
--- Display.cc.orig Sat Jul 27 03:48:19 2002
+++ Display.cc Thu Nov 14 21:43:19 2002
@@ -1420,27 +1420,27 @@
// Other changes to the score can happen now
// Or be calculated by the result match in getScore()
- // This formula derived through experimentation
- // We want older docs to have smaller values and the
- // ultimate values to be a reasonable size (max about 100)
-
base_score = score;
+ const double year = 31556925.97; // number of seconds in a year
if (date_factor != 0.0)
{
- date_score = date_factor *
- ((thisRef->DocTime() * 1000.0 / (double)now) - 900);
- score += date_score;
+// AIL: exponential decay with time: older docs have smaller date_score.
+// date_factor=0.3 results in a fivefold decrease of the date_score for
+// a 5-years-old document.
+ date_score = exp(- date_factor *
+ (double)(now - (thisRef->DocTime())) / year);
+ score *= date_score;
}
if (backlink_factor != 0.0)
{
int links = thisRef->DocLinks();
- if (links == 0)
- links = 1; // It's a hack, but it helps...
-
- backlink_score = backlink_factor
- * (thisRef->DocBackLinks() / (double)links);
- score += backlink_score;
+// AIL: new strategy: more links -- more informative page
+// backlink_factor=0.04 results in a twofold increase of the backlink_score
+// for a document with 25 links.
+ backlink_score = (1.0 + backlink_factor *
+ (thisRef->DocBackLinks() + (double)links));
+ score *= backlink_score;
}
if (debug) {
-------------------------------------------------------
This sf.net email is sponsored by: To learn the basics of securing
your web site with SSL, click here to get a FREE TRIAL of a Thawte
Server Certificate: http://www.gothawte.com/rd524.html
_______________________________________________
htdig-general mailing list <[EMAIL PROTECTED]>
To unsubscribe, send a message to <[EMAIL PROTECTED]> with a
subject of unsubscribe
FAQ: http://htdig.sourceforge.net/FAQ.html