Re: [htdig] htdig-3.1.5 +prune_parent_dir_href patch version 0.0

Gilles Detillieux Tue, 09 May 2000 15:16:18 -0700
According to Peter L. Peres:
> + 1.1 The problem:
> +
> + When on an open system (ex: Linux) used on an intranet (no direct connection
> + to the Internet), documentation is added to the HTML DocumentRoot tree, by
> + adding symbolic links to the documentation under the DocumentRoot, and htdig
> + is used to index this information, then htdig (3.1.5) will enter an endless
> + loop or try to index the entire system.
> +
> + It does this by reaping the url of the 'parent directory' in
> + Apache-generated indexes of directories (such as, of the directories that
> + are soft-linked under the DocumentRoot). The 'parent directories' of a
> + sirectory entered by a symbolic link, leads back all the way to root '/'. If
> + the patch is not applied, then htdig will try to index the entire system,
> + and may loop if any cross-linking exists.
...
> + then recompile and reinstall the htdig (make; make install). Edit the config
> + file to turn on the new option, add a symbolic link to the DocumentRoot
> + (f.ex. cd /usr/local/httpd/htdocs/misc; ln -s /usr/doc .; on Suse systems),
> + and run htdig (rundig).

I'm still having a great deal of trouble envisioning how htdig can follow
symbolic links in such a way as to make the entire file system visible,
unless it finds a symbolic link to the root directory of the file system.
To use your example, the link /usr/local/httpd/htdocs/misc/doc -> /usr/doc
would make the whole /usr/doc sub-tree appear under the URL
http://localhost/misc/doc/, but when you follow the parent directory link,
it should take you to http://localhost/misc/, not to /usr!  What would an
URL that leads to /usr or / look like, given that URLs are supposed to be
relative to the DocumentRoot?

I do understand, though, how cross links could lead to a file-system loop
when all symbolic links are followed, so I suspect that was the source of
your problem.  Still, I can envision situations where mutual cross links
between two subtrees could lead to infinite loops even without following
up links.  Essentially, the spider would be constantly descending deeper
into a hierarchy that doesn't end, because the backward links are concealed
as downward links.  It's a theoretical possibility, anyway.

> + To see the patch working, run htsearch with -v. The patch causes a bang
                                  ^^^^^^^^
I assume you mean htdig here.

> diff -rcN tmp/htdig-3.1.5/htdig/HTML.cc htdig-3.1.5/htdig/HTML.cc
> *** tmp/htdig-3.1.5/htdig/HTML.cc       Fri Feb 25 04:29:10 2000
> --- htdig-3.1.5/htdig/HTML.cc   Mon May  4 01:11:01 1998
> ***************
> *** 394,400 ****
>                   head << word;
>             }
> 
> !           if (word.length() >= minimumWordLength && doindex)
>             {
>               retriever.got_word(word,
>                                  int(offset * 1000 / totlength),
> --- 394,400 ----
>                   head << word;
>             }
> 
> !           if ((word.length() >= (unsigned)minimumWordLength) && doindex)
>             {
>               retriever.got_word(word,
>                                  int(offset * 1000 / totlength),

What does this change do?  Were you getting warnings before?

> + void
> + Retriever::chop_url(ChoppedUrlStore &cus,char *c_url)
> + {
> +   int l;
> +
> +   cus.url_store[0] = '\0';
> +   cus.hop_count = 0;
> +   l = strlen(c_url);
> +   if((l == 0) || (l > MAX_CAN_URL_LEN)) {

You'll overrun the end of url_store if l == MAX_CAN_URL_LEN.  Remember the
null terminator.

> +     if(debug > 0)
> +       cout << "chop_url: failed on len==0\n";
> +     return;
> +   }
> +   strcpy(cus.url_store,c_url);
> +   l = 0;
> +   if((cus.url_store_chopped[l++] = strtok(cus.url_store,"/")) == NULL) {
> +     cus.url_store[0] = '\0';
> +     if(debug > 0)
> +       cout << "chop_url: failed on NULL with " << c_url << "\n";
> +     return;
> +   }
> +   while((cus.url_store_chopped[l++] = strtok(NULL,"/")) != NULL) {
> +     if(l > MAX_CAN_URL_HOPS) {
> +       cus.url_store[0] = '\0';
> +       return; // fail silently with a valid url, print a bang somewhere else
> +     }
> +   }
> +   cus.hop_count = l - 1;
> +   return; // success
> + }
> +
> + // call this function to store the base URL of a document being indexed,
> + // when starting to index it (in HTML::parse or ExternalParser::parse)
> + void
> + Retriever::store_url(char *c_url)
> + {
> +   chop_url(gus,c_url);
> +   return;
> + }
> +
> + // call this function to decide if a reaped URL is a direct parent of
> + // the URL being indexed. call in Retriever::got_href()
> + int
> + Retriever::url_is_parent_dir(char *c_url)
> + {
> +   int j,k;
> +   ChoppedUrlStore cus;
> +
> +   if(gus.hop_count == 0)
> +     return 0;
> +
> +   chop_url(cus,c_url);
> +   if(cus.hop_count == 0)
> +     return 0;
> +
> +   // seek a matching last part, backwards
> +   j = gus.hop_count - 1;
> +   k = cus.hop_count - 1;
> +   while(strcmp(gus.url_store_chopped[j],cus.url_store_chopped[k]) != 0)
> +     if(--j < 0)
> +       return 0; // not

What if a path component is repeated, e.g. /files/doc/html/doc/foo.html?
It seems this code could get confused by the repeated name, which could
cause a false match at the lower directory level.

> +   while((--j >= 0)&&(--k >= 0))
> +     if(strcmp(gus.url_store_chopped[j],cus.url_store_chopped[k]) != 0)
> +       return 0; // not
> +   return 1; // yes
> + }

-- 
Gilles R. Detillieux              E-mail: <[EMAIL PROTECTED]>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930

------------------------------------
To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
Re: [htdig] htdig-3.1.5 +prune_parent_dir_href patch version 0.0

Reply via email to