According to Pub Litics:
> OK, been working on this for the past nine hours or so, but seem to
> be stumped.
> 
> I have gotten into htdig.conf file, primarily (since that is the file
> to which my attention was directed).  I have a main home directory
> as the starting point.  Under that my directory structure consists of
> text files, image files, htdig files, forum files [the forum files are
> the ones I do not want indexed], applet files and cgi-bin.  Under each
> sub-directory there are many, many sub-directories.  The site, itself,
> is about 300 MG.
> 
> There is a mysql database for the forum, but it is tucked away in
> the server and not referenced in the directory structure, except for
> the file which calls it up. Under excluded URLs, I had listed /forum.
> This listing did not stop htdig from searching and indexing thousands
> of unwanted listings, however, from the forum.  A typical listing looks
> like:  /forum/viewforum.php?f=3&sid=f4d181d874cbc2cc0f41f2927959f2c5
> 
> I tried /forum/ [with an added forward-slash], but that did not help.
> Would it be possible to start at the sub-directory level, perhaps,
> with multiple starting points?
> 
> At present, the search engine is totally useless because it searches
> and indexes repeatedly.  The suggestion asks "where to prune?"  I would
> reply "anywhere to exclude /forum and all under it."
> 
> I tried to understand what you mean by the bad query string process,
> but I cannot figure out what you mean.  I have read all the material
> and inspected htdig.conf copiously, but (I apologize) I do not know
> what I am supposed to do.  Help!  Thanks.

I must apologize for misleading you yesterday.  I was under the mistaken
impression then that you wanted to index certain parts of the /forum
subdirectory, but wanted to rein in htdig because it was getting lost in
all the cross-links that the forums scripts generate.  In rereading your
earlier messages, as well as this one, I see that you quite clearly stated
you don't want any of /forum indexed.  So, you're not looking to prune
this tree down to size at all, you're looking to lop it off at the root.

The question, then, is not what the proper settings of exclude_urls and
bad_querystr ought to be.  You're already trying the proper setting
of exclude_urls, as stated above and previously, and bad_querystr
doesn't apply if you don't want any of these scripts indexed at all.
The question is why htdig isn't taking your exclude_urls setting.

If you've exhausted all the possibilities of FAQ 5.31, then I'd suggest
another possibility that's bitten other users lately.  When you Control-C
out of htdig, it now by default creates a db.log file of all the URLs
that have been "pushed" but not indexed so far.  This file tends to be
persistent, because if it's there, htdig reads it and if you interrupt
htdig again, it recreates it again.  So, if you interrupted htdig before
changing exclude_urls, and restarted htdig afterward, the db.log file
(in your database directory) may have had a number of /forum URLs that
had already been pushed prior to the change to exclude_urls, that don't
get rechecked afterward.  If this is the case, simply delete the file
and restart htdig to truly restart from scratch.

I think we need to get htdig to issue a warning if it loads a db.log
file that's older than the config file it's using.  In fact, it would
make sense for htdig to record the modtime and a checksum of the config
file in db.log to make sure you're restarting with the same config.

-- 
Gilles R. Detillieux              E-mail: <[EMAIL PROTECTED]>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/
Dept. Physiology, U. of Manitoba  Winnipeg, MB  R3E 3J7  (Canada)


-------------------------------------------------------
This sf.net email is sponsored by:ThinkGeek
Welcome to geek heaven.
http://thinkgeek.com/sf
_______________________________________________
htdig-general mailing list <[EMAIL PROTECTED]>
To unsubscribe, send a message to <[EMAIL PROTECTED]> with a 
subject of unsubscribe
FAQ: http://htdig.sourceforge.net/FAQ.html

Reply via email to