--On 27 April 2004 23:20 +0100 Karl Pielorz <[EMAIL PROTECTED]> wrote:
The only thing I can think of is maybe you have lines in url_list.txt that are too long. That file will be read by ParsedString::getFileContents(), in htlib/ParsedString.cc, and it uses a 1000 character buffer to read in lines from the file. Any lines that are longer will be chopped in two, and it could be that the remaining fragment is responsible for the false matches you're seeing. If that's the case, you can increase the buffer size to something bigger than the largest URL you need to deal with, or rewrite the code to deal with any size line. (If you do the latter, we'd appreciate the patch.)
I finally got to the bottom of this - after having a merry trip around htlib/StringMatch.cc & Co.
I finally traced it down to spaces being present in some of the URL's in the ${start_url} file specified in htdig.conf
What was happening was these were being passed to StringMatch::Pattern(...) as 'fragments', e.g.
http://www.somewhere.com/some-page-you want-indexing .html
Would be added to the limits list as:
" http://www.somewhere.com/some-page-you want-indexing .html "
Thus causing _any_ url that happened to have a .html in it to match (which, lets face it is going to be a lot).
I'm surprised htDig doesn't have a separate 'definitive' list of just URL's it's allowed to "touch" - I guess this would be duplicating stuff already used & handled with 'limits'.
At least if it happens again - I know where to look...
-Karl
------------------------------------------------------- This SF.Net email is sponsored by: SourceForge.net Broadband Sign-up now for SourceForge Broadband and get the fastest 6.0/768 connection for only $19.95/mo for the first 3 months! http://ads.osdn.com/?ad_id=2562&alloc_id=6184&op=click _______________________________________________ ht://Dig general mailing list: <[EMAIL PROTECTED]> ht://Dig FAQ: http://htdig.sourceforge.net/FAQ.html List information (subscribe/unsubscribe, etc.) https://lists.sourceforge.net/lists/listinfo/htdig-general

