Last week, I wrote:
> According to David Adams:
> > I am also using ht://Dig version 3.1.6 and for me it IS indexing URLs like
> > 
> > http://www.soton.ac.uk/~lopsoc/gallery.php?gallery=sorcerer1&photo=CNV00023.jpg
> > 
> > even though I have .jpg in my bad_extensions: list.
> 
> Actually, I find this surprising.  Upon looking at the code that handles
> bad_extensions, in both 3.1.6 and 3.2.0b5, it seems to me that there is
> indeed a bug in the way htdig locates filename extensions in URLs, as
> Toby described.  Can you confirm that you're running vanilla 3.1.6 with
> no patches to htdig/Retriever.cc which might correct this bug?
> 
> The fix to the code should be pretty simple, but I haven't had the time
> to sit down and stare at it long enough to get the fix coded yet.  I'll
> try to get around to it by Friday, so it'll be in the next development
> snapshot for the 3.2 betas, and posted to the list.

OK, last week got a bit crazy, so I wrote the patch yesterday afternoon,
just before the end of my work day.  Here it is.  Apply it in your main
3.1.6 source directory using "patch -p0 < this-message-file".  Please
let me know if it solves the problem for you and/or causes others.  I've
made sure the code compiles with the patch, but haven't tested it beyond
that.  Thanks.

--- htdig/Retriever.cc.orig     2002-01-25 07:44:49.000000000 -0600
+++ htdig/Retriever.cc  2004-03-29 17:40:07.000000000 -0600
@@ -711,16 +711,17 @@ Retriever::IsValidURL(char *u)
     //
     // See if the path extension is in the list of invalid ones
     //
-    char       *ext = strrchr(url, '.');
+    String     urlpath = url.get();
+    int parm = urlpath.indexOf('?');   // chop off URL parameter
+    if (parm >= 0)
+       urlpath.chop(urlpath.length() - parm);
+    char       *ext = strrchr(urlpath, '.');
     String     lowerext;
     if (ext && strchr(ext, '/'))       // Ignore a dot if it's not in the
       ext = NULL;                      // final component of the path.
     if (ext)
       {
        lowerext = ext;
-       int parm = lowerext.indexOf('?');       // chop off URL parameter
-       if (parm >= 0)
-           lowerext.chop(lowerext.length() - parm);
        lowerext.lowercase();
        if (invalids->Exists(lowerext))
          {

-- 
Gilles R. Detillieux              E-mail: <[EMAIL PROTECTED]>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/
Dept. Physiology, U. of Manitoba  Winnipeg, MB  R3E 3J7  (Canada)


-------------------------------------------------------
This SF.Net email is sponsored by: IBM Linux Tutorials
Free Linux tutorial presented by Daniel Robbins, President and CEO of
GenToo technologies. Learn everything from fundamentals to system
administration.http://ads.osdn.com/?ad_id=1470&alloc_id=3638&op=click
_______________________________________________
ht://Dig general mailing list: <[EMAIL PROTECTED]>
ht://Dig FAQ: http://htdig.sourceforge.net/FAQ.html
List information (subscribe/unsubscribe, etc.)
https://lists.sourceforge.net/lists/listinfo/htdig-general

Reply via email to