According to Martin =?iso-8859-2?Q?Ma=E8ok?=: > On Sat, Oct 26, 2002 at 10:06:12PM -0500, Geoff Hutchison wrote: > > Instead, I'd suggest using the SourceForge bug tracker for ht://Dig > > http://sourceforge.net/tracker/?atid=3D104593&group_id=3D4593&func=3Dbrow= > se > > OK, I've tried to avoid it ;-) If this doesn't get resolved soon > I will submit it. > > > >I found it with wwwoffle cache indexing scripts. htdig 3.1.x worked > > >well but after upgrading to 3.2.0b4-072201 it broke. The cached > > >pages are under "/search/index" directory and "/index" is > > >disallowed. You can see that 3.2.0b rejects "/search/index" in > > >debug output: > >=20 > > Yes. I can't see anything in particular that would have solved this > > in the meantime (which surprises me since I seem to remember this > > before). For my own benefit, could you confirm that it fails for you > > on the current snapshot? > > Hm, I've got sucking slow and expensive dialup here (Czech Republic, > monopolistic phone operator ... you know) so I would like to avoid > downloading extra 2MB ...
I don't think you need to do this. I'm pretty sure, by looking at the code, that it's still a problem in current snapshots. I don't recall any recent changes to this part of things. The problem is that while 3.1 used StringList::Compare, which does an "anchored" match (it doesn't search for a substring like StringList::FindFirst does), the 3.2 code uses HtRegex::match, which does an unanchored match. So, in 3.2 it matches the substring anywhere in the URL, instead of just at the start of the path component of the URL. > Back to topic - I've got ht://Dig 3.2.0b4-072201 source code here and > I tried to fix it after some short time of looking at the code. See > the attachment and review it cause I'm not too familiar with htdig > code internals, this is just a quick-try-hack, but it seems to be > working here but not heavily tested though... > > By the way, I think that using regular expressions here is a way too > big hammer for this simple task (i.e. just for testing if one string > is equal to or just an extension of another). Robots.txt is not > defined to contain regular expressions but htdig handles disallow > lines as if they are regexps. Are you sure that won't cause any > problems if somebody puts some "weird" characters in it? I believe your patch will fix this problem correctly. But I think you're right about "weird" characters causing problems. The pattern that's built from disallow lines is given to HtRegex::set, not HtRegex::setEscaped, so regular expression meta characters are taken as operational. You're also right about this being too big a hammer for the task. The old way did the job according to the standard, so that's ultimately what we should go back to. -- Gilles R. Detillieux E-mail: <[EMAIL PROTECTED]> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/ Dept. Physiology, U. of Manitoba Winnipeg, MB R3E 3J7 (Canada) ------------------------------------------------------- This sf.net email is sponsored by: See the NEW Palm Tungsten T handheld. Power & Color in a compact size! http://ads.sourceforge.net/cgi-bin/redirect.pl?palm0001en _______________________________________________ htdig-dev mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/htdig-dev
