According to Pub Litics: > OK, been working on this for the past nine hours or so, but seem to > be stumped. > > I have gotten into htdig.conf file, primarily (since that is the file > to which my attention was directed). I have a main home directory > as the starting point. Under that my directory structure consists of > text files, image files, htdig files, forum files [the forum files are > the ones I do not want indexed], applet files and cgi-bin. Under each > sub-directory there are many, many sub-directories. The site, itself, > is about 300 MG. > > There is a mysql database for the forum, but it is tucked away in > the server and not referenced in the directory structure, except for > the file which calls it up. Under excluded URLs, I had listed /forum. > This listing did not stop htdig from searching and indexing thousands > of unwanted listings, however, from the forum. A typical listing looks > like: /forum/viewforum.php?f=3&sid=f4d181d874cbc2cc0f41f2927959f2c5 > > I tried /forum/ [with an added forward-slash], but that did not help. > Would it be possible to start at the sub-directory level, perhaps, > with multiple starting points? > > At present, the search engine is totally useless because it searches > and indexes repeatedly. The suggestion asks "where to prune?" I would > reply "anywhere to exclude /forum and all under it." > > I tried to understand what you mean by the bad query string process, > but I cannot figure out what you mean. I have read all the material > and inspected htdig.conf copiously, but (I apologize) I do not know > what I am supposed to do. Help! Thanks.
I must apologize for misleading you yesterday. I was under the mistaken impression then that you wanted to index certain parts of the /forum subdirectory, but wanted to rein in htdig because it was getting lost in all the cross-links that the forums scripts generate. In rereading your earlier messages, as well as this one, I see that you quite clearly stated you don't want any of /forum indexed. So, you're not looking to prune this tree down to size at all, you're looking to lop it off at the root. The question, then, is not what the proper settings of exclude_urls and bad_querystr ought to be. You're already trying the proper setting of exclude_urls, as stated above and previously, and bad_querystr doesn't apply if you don't want any of these scripts indexed at all. The question is why htdig isn't taking your exclude_urls setting. If you've exhausted all the possibilities of FAQ 5.31, then I'd suggest another possibility that's bitten other users lately. When you Control-C out of htdig, it now by default creates a db.log file of all the URLs that have been "pushed" but not indexed so far. This file tends to be persistent, because if it's there, htdig reads it and if you interrupt htdig again, it recreates it again. So, if you interrupted htdig before changing exclude_urls, and restarted htdig afterward, the db.log file (in your database directory) may have had a number of /forum URLs that had already been pushed prior to the change to exclude_urls, that don't get rechecked afterward. If this is the case, simply delete the file and restart htdig to truly restart from scratch. I think we need to get htdig to issue a warning if it loads a db.log file that's older than the config file it's using. In fact, it would make sense for htdig to record the modtime and a checksum of the config file in db.log to make sure you're restarting with the same config. -- Gilles R. Detillieux E-mail: <[EMAIL PROTECTED]> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/ Dept. Physiology, U. of Manitoba Winnipeg, MB R3E 3J7 (Canada) ------------------------------------------------------- This sf.net email is sponsored by:ThinkGeek Welcome to geek heaven. http://thinkgeek.com/sf _______________________________________________ htdig-general mailing list <[EMAIL PROTECTED]> To unsubscribe, send a message to <[EMAIL PROTECTED]> with a subject of unsubscribe FAQ: http://htdig.sourceforge.net/FAQ.html

