According to Joe R. Jah:
> I downloaded and installed it on a BSDI 4.0 box; it compiled but, htsearch
> dumped core.  I followed the old BSDI/htdig fix:
...
> everything worked except my the old local duplicate suppressor patch:
> ftp://sol.ccsf.cc.ca.us/htdig-patches/3.0.8b2/Retriever.cc.0
> did not quite do its job.
...
> As you see database sizes do not vary too much, but the results pages
> point to the same URL MULTIPLE times in 3.1.4 case; baffling;-/?

I tried to apply this patch to the 3.1.4 prerelease just now, and it failed
entirely.  Did you apply it manually?  Did you change the IsLocal() call in
the patch to GetLocal() instead, as is needed by the new Retriever code?
Did you run htdig with -v to see if the patched in duplicate suppression
code was actually being activated?

Different database sizes could be due to the fact that 3.1.4 indexes img
alt text, and doesn't clobber words immediately following bare ampersands.
I can't imagine why you'd see the exact same URL multiple times, but it
may be that in manually applying the patch to Need2Get, you broke the
function.

Here's a 3.1.4 adaptation of this old patch, completely untested of course,
but if you want to give it a shot, please do.  If the old code worked, I
can see no reason why this patch wouldn't.  I removed the "return TRUE;"
after the visited.Add() call, which would have caused a memory leak as it
didn't delete the local_filename at that point.


[Adapted from patch by Warren Jones]
This patch to ht://Dig allows it to reject URLs on a local host that
are links (through the file system) to a URL that has already been
indexed.  This works with the local_urls option in version 3.1.4.
I didn't bother to create another hashtable, but just added a key
based on the file's device and inode numbers to Retriever::visited.
The following patch was made against version 3.1.4.

--- htdig/Retriever.cc.orig     Fri Dec  3 11:58:38 1999
+++ htdig/Retriever.cc  Tue Dec  7 15:41:42 1999
@@ -18,6 +18,8 @@
 #include <signal.h>
 #include <assert.h>
 #include <stdio.h>
+#include <sys/types.h>
+#include <sys/stat.h>
 #include "HtWordType.h"
 
 static WordList        words;
@@ -603,7 +605,36 @@ Retriever::Need2Get(char *u)
     static String      url;
     url = u;
 
-    return !visited.Exists(url);
+    if ( visited.Exists(url) )
+       return FALSE;
+       
+    String *local_filename = GetLocal(u);   // For local URL's, check
+    if ( local_filename )                  // list for device and inode
+    {                                      // to make sure we haven't
+       struct stat buf;                    // already indexed a link
+                                           // to this file.
+
+       if ( stat(local_filename->get(),&buf) == 0 )
+       {
+           char key[2*sizeof(ino_t)+2*sizeof(dev_t)+2];      // Make hash key
+           sprintf( key, "%x+%x", buf.st_dev, buf.st_ino );  // from device
+           if ( visited.Exists(key) )                        // and inode.
+           {
+               if ( debug ) {
+                   String *dup = (String*)visited.Find(key);
+                   cout << endl
+                        << "Duplicate: " << local_filename->get()
+                        << " -> "        << dup->get() << endl;
+               }
+               delete local_filename;
+               return FALSE;
+           }
+           visited.Add(key,local_filename);
+       }
+       delete local_filename;
+    }
+    return TRUE;
+
 }
 
 


-- 
Gilles R. Detillieux              E-mail: <[EMAIL PROTECTED]>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930

------------------------------------
To unsubscribe from the htdig3-dev mailing list, send a message to
[EMAIL PROTECTED] 
You will receive a message to confirm this. 

Reply via email to