I was very happy to find that the "valid_extensions" option has
been added in version 3.1.4 -- something like this is essential
given the rather chaotic nature of the web server that I have
to index. But I found that a couple changes were necessary
to make valid_extensions work the way I wanted it to.
If "valid_extensions" are defined, I'd like to retrieve URL's
without extensions *if_and_only_if* they represent a directory.
However, I found that all URL's without extensions are rejected
if the URL contains a fully qualified domain name, e.g.:
http://www.foo.com/bar/
Retriever::IsValidURL() rejects this URL because it thinks
the extension is:
.com/bar/
The patch for Retriever.cc (included below) fixes this.
To insure that a URL without an extension will be retrieved
only if it's a directory, I modified URL::normalize() so that
a slash is appended to any URL that doesn't have an extension.
This guarantees that retrieval will fail if the URL is not
a directory. This works for me, but I'm not sure that it's
the best solution -- comments would be appreciated.
--
Warren Jones
Fluke Corporation
---------------------------- snip snip ----------------------------
Index: Retriever.cc
===================================================================
RCS file: /home/wjones/src/CVS.repo/htdig/htdig/Retriever.cc,v
retrieving revision 1.1.1.5
diff -c -r1.1.1.5 Retriever.cc
*** Retriever.cc 1999/12/15 22:06:09 1.1.1.5
--- Retriever.cc 2000/01/11 00:28:29
***************
*** 702,707 ****
--- 702,709 ----
//
char *ext = strrchr(url, '.');
String lowerext;
+ if ( ext && strchr(ext,'/') ) // Ignore a dot if it's not in the
+ ext = NULL; // final component of the path.
if (ext)
{
lowerext = ext;
Index: URL.cc
===================================================================
RCS file: /home/wjones/src/CVS.repo/htdig/htlib/URL.cc,v
retrieving revision 1.1.1.5
diff -c -r1.1.1.5 URL.cc
*** URL.cc 1999/12/15 22:06:35 1.1.1.5
--- URL.cc 2000/01/11 23:09:26
***************
*** 469,474 ****
--- 469,490 ----
removeIndex(_path);
+ if ( *config["valid_extensions"] != '\0' )
+ {
+ // If we're only accepting valid extensions, then append
+ // a trailing slash to any URL without an extension.
+ // This insures that the only URL's without extensions
+ // we retrieve will be directories.
+
+ char *slash = strrchr( _path, '/' );
+ if ( ! slash || slash[1] != '\0' )
+ {
+ char *dot = strrchr( _path, '.' );
+ if ( dot <= slash )
+ _path << "/";
+ }
+ }
+
//
// Convert a hostname to an IP address
//
------------------------------------
To unsubscribe from the htdig3-dev mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.