I was very happy to find that the "valid_extensions" option has
been added in version 3.1.4 -- something like this is essential
given the rather chaotic nature of the web server that I have
to index.  But I found that a couple changes were necessary
to make valid_extensions work the way I wanted it to.

If "valid_extensions" are defined, I'd like to retrieve URL's
without extensions *if_and_only_if* they represent a directory.
However, I found that all URL's without extensions are rejected
if the URL contains a fully qualified domain name, e.g.:

     http://www.foo.com/bar/

Retriever::IsValidURL() rejects this URL because it thinks
the extension is:

     .com/bar/

The patch for Retriever.cc (included below) fixes this.

To insure that a URL without an extension will be retrieved
only if it's a directory, I modified URL::normalize() so that
a slash is appended to any URL that doesn't have an extension.
This guarantees that retrieval will fail if the URL is not
a directory.  This works for me, but I'm not sure that it's
the best solution -- comments would be appreciated.

-- 
Warren Jones
Fluke Corporation

---------------------------- snip snip ----------------------------

Index: Retriever.cc
===================================================================
RCS file: /home/wjones/src/CVS.repo/htdig/htdig/Retriever.cc,v
retrieving revision 1.1.1.5
diff -c -r1.1.1.5 Retriever.cc
*** Retriever.cc        1999/12/15 22:06:09     1.1.1.5
--- Retriever.cc        2000/01/11 00:28:29
***************
*** 702,707 ****
--- 702,709 ----
      //
      char      *ext = strrchr(url, '.');
      String    lowerext;
+     if ( ext && strchr(ext,'/') )     // Ignore a dot if it's not in the
+       ext = NULL;                     // final component of the path.
      if (ext)
        {
        lowerext = ext;

Index: URL.cc
===================================================================
RCS file: /home/wjones/src/CVS.repo/htdig/htlib/URL.cc,v
retrieving revision 1.1.1.5
diff -c -r1.1.1.5 URL.cc
*** URL.cc      1999/12/15 22:06:35     1.1.1.5
--- URL.cc      2000/01/11 23:09:26
***************
*** 469,474 ****
--- 469,490 ----
  
      removeIndex(_path);
  
+     if ( *config["valid_extensions"] != '\0' )
+     { 
+       // If we're only accepting valid extensions, then append
+       // a trailing slash to any URL without an extension.
+       // This insures that the only URL's without extensions
+       // we retrieve will be directories.
+ 
+       char *slash = strrchr( _path, '/' );
+       if ( ! slash || slash[1] != '\0' )
+       {
+           char *dot = strrchr( _path, '.' );
+           if ( dot <= slash )
+               _path << "/";
+         }
+     }
+ 
      //
      // Convert a hostname to an IP address
      //

------------------------------------
To unsubscribe from the htdig3-dev mailing list, send a message to
[EMAIL PROTECTED] 
You will receive a message to confirm this. 

Reply via email to