hi all,
 
I discovered a few problems with the score calculation and with mutliple
config files in htsearch/Display.cc in ht://dig, version htdig-3.2.0b3,
CVS snapshot 080600.

1. In Display::BuildMatchList, the member variables minScore and
maxScore are set to the min resp. max of the calculated scores, while
the score value used later in Display::generateStars is set to 1 +
log(score).
 
2. The member variables minScore and maxScore of the class Display have
the type double; ResultMatch.score has the type float -- but
DocumentRef.docScore is of type int. Now, Display::buildMatchList
calculates the score value for each match as type double, and stores the
value as a float in a ResultMatch instance; but Display::generateStars
uses the int value DocumentRef::DocScore to calculate the number of
stars to be displayed.
 
Thus, at least the minScore-match will in most cases have a negative
relative score value in Display::generateStars: (int(minScore) -
minScore) / (maxScor - minScore) is in most cases a negative number.
 
Especially the "1+log(score) inconsistency" can lead to a huge number of
"<img src=...>" strings in the output. In a test, I got HTML up to 9 MB.
This can also happen without the log inconsistency, if minScore and
maxScore do not differ very much.

3. If minScore == maxScore, the statement in line 864 in Display.cc:
 
    score = (ref->DocScore() - minScore) / (maxScore - minScore);
 
should result in a divide by zero error. (well, I didn't check; but I
also did not find anything like "if (maxScore == minScore) ..."
 
A comment to my patch: I don't have very much experience with the
"scoring system" of ht://dig (or ht://dig in general), but I don't
think, that it is a good idea to give relative ranking information for a
result set with minScore = 11000.090909 and maxScore = 11000.636364.
(real example :)
 
Therefore, with the attached patch, Display::generateStars does not
return any stars, if minScore and maxScore are too close. The chosen
threshold MINSCOREDIFF probably needs to be adjusted.
 
4. In Display.cc, line 680, the "config=some_file_name" (as parsed from
the program input) is appended to the URL that is used to select
different result pages, but a few lines later, "config=..." is added
again, this time from collectionList. Thus, we get the config parameter
twice in the URL.
                                                                                       
                                                        
5. If two config files specified (for searches in in two databases, or
accidentally by the bug mentioned above), and if the second config file
contains an "include" statement, the lines following this include
statement are not used.
 
This means, that the following idea for config files is at present not
usable, if two databases should be searched:
 
        # read common parameters:
        include:        htdig.conf
        database_base:  realm1
        start_url:      http://my.server/realm1
 
htsearch tries to use the default database instead. For tests, this
behaviour should be easily reproducible, if the are no database files
with the default name: htsearch will spit out the error message "Unable
to read word database file '/opt/www/var/htdig/db.words.db'".
 
The attached diff output should fix problems 1 to 4; I don't know what
to do with the "multiple config/include" problem. My guess is that
something is wrong with the yacc files, but I don't have enough
experience with yacc/bison/[f]lex to look deeper into that for myself...
(BTW, I'm using Suse Linux 6.3, egcs-2.91.66, GNU Bison version 1.25,
flex version 2.5.4)
 
Abel
--- htcommon/DocumentRef.h.orig Tue Aug 15 00:46:08 2000
+++ htcommon/DocumentRef.h      Thu Aug 17 00:01:37 2000
@@ -64,7 +64,7 @@
     ReferenceState     DocState()                      {return docState;}
     int                        DocSize()                       {return docSize;}
     List               *DocAnchors()                   {return &docAnchors;}
-    int                        DocScore()                      {return docScore;}
+    double             DocScore()                      {return docScore;}
     int                 DocSig()                        {return docSig;}
     int                        DocAnchor()                     {return docAnchor;}
     int                        DocHopCount()                   {return docHopCount;}
@@ -89,7 +89,7 @@
     void                DocSig(int s)                   {docSig = s;}
     void               DocAnchors(List &l)             {docAnchors = l;}
     void               AddAnchor(const char *a);
-    void               DocScore(int s)                 {docScore = s;}
+    void               DocScore(double s)              {docScore = s;}
     void               DocAnchor(int a)                {docAnchor = a;}
     void               DocHopCount(int h)              {docHopCount = h;}
     void               DocEmail(const char *e)         {docEmail = e;}
@@ -156,7 +156,7 @@
     //
     
     // This is the current score of this document.
-    int                        docScore;
+    double             docScore;
     // This is the nearest anchor for the search word.
     int                        docAnchor;
 
--- htsearch/Display.cc.orig    Mon Aug 14 22:45:06 2000
+++ htsearch/Display.cc Thu Aug 17 00:35:25 2000
@@ -41,6 +41,8 @@
 # define DBL_MAX MAXFLOAT
 #endif
 
+#define MINSCOREDIFF 10.0
+
 //*****************************************************************************
 //
 Display::Display(Dictionary *collections)
@@ -324,7 +326,9 @@
     vars.Add("SIZEK", new String(form("%d",
                                          (ref->DocSize() + 1023) / 1024)));
 
-    if (maxScore != 0)
+    double diff = maxScore - minScore;
+    
+    if (diff && (maxScore + minScore) / diff < MINSCOREDIFF)
       {
        int percent = (int)((ref->DocScore() - minScore) * 100 /
                            (maxScore - minScore));
@@ -672,8 +676,9 @@
        url << "restrict=" << encodeInput("restrict") << ';';
     if (input->exists("exclude"))
        url << "exclude=" << encodeInput("exclude") << ';';
-    if (input->exists("config"))
-       url << "config=" << encodeInput("config") << ';';
+    // Not needed: The next loop below handles this output
+    //if (input->exists("config"))
+    // url << "config=" << encodeInput("config") << ';';
 
     // Put out all specified collections. If none selected, resort to
     // default behaviour.
@@ -858,17 +863,20 @@
 
     String image = config["star_image"];
     const String blank = config["star_blank"];
-    double     score;
+    double     score, diff;
 
-    if (maxScore != 0)
+    
+    diff = maxScore - minScore;
+    
+    if (diff && (maxScore + minScore) / diff < MINSCOREDIFF)
     {
-       score = (ref->DocScore() - minScore) / (maxScore - minScore);
+       score = (ref->DocScore() - minScore) / diff;
     }
     else
     {
-       maxScore = ref->DocScore();
-       score = 1;
+        return result;
     }
+
     int                nStars = int(score * (maxStars - 1) + 0.5) + 1;
 
     if (right)
@@ -1144,7 +1152,8 @@
        // Get rid of it to free the memory!
        delete thisRef;
 
-       thisMatch->setScore(1.0 + log(score));
+       score = 1.0 + log(score);
+       thisMatch->setScore(score);
        thisMatch->setAnchor(dm->anchor);
                
        //

------------------------------------
To unsubscribe from the htdig3-dev mailing list, send a message to
[EMAIL PROTECTED] 
You will receive a message to confirm this. 

Reply via email to