According to Martin =?iso-8859-2?Q?Ma=E8ok?=:
> On Sat, Oct 26, 2002 at 10:06:12PM -0500, Geoff Hutchison wrote:
> > Instead, I'd suggest using the SourceForge bug tracker for ht://Dig
> > http://sourceforge.net/tracker/?atid=3D104593&group_id=3D4593&func=3Dbrow=
> se
> 
> OK, I've tried to avoid it ;-) If this doesn't get resolved soon
> I will submit it.
> 
> > >I found it with wwwoffle cache indexing scripts. htdig 3.1.x worked
> > >well but after upgrading to 3.2.0b4-072201 it broke. The cached
> > >pages are under "/search/index" directory and "/index" is
> > >disallowed. You can see that 3.2.0b rejects "/search/index" in
> > >debug output:
> >=20
> > Yes. I can't see anything in particular that would have solved this
> > in the meantime (which surprises me since I seem to remember this
> > before). For my own benefit, could you confirm that it fails for you
> > on the current snapshot?
> 
> Hm, I've got sucking slow and expensive dialup here (Czech Republic,
> monopolistic phone operator ... you know) so I would like to avoid
> downloading extra 2MB ...

I don't think you need to do this.  I'm pretty sure, by looking at the
code, that it's still a problem in current snapshots.  I don't recall
any recent changes to this part of things.

The problem is that while 3.1 used StringList::Compare, which
does an "anchored" match (it doesn't search for a substring like
StringList::FindFirst does), the 3.2 code uses HtRegex::match, which
does an unanchored match.  So, in 3.2 it matches the substring anywhere
in the URL, instead of just at the start of the path component of the URL.

> Back to topic - I've got ht://Dig 3.2.0b4-072201 source code here and
> I tried to fix it after some short time of looking at the code. See
> the attachment and review it cause I'm not too familiar with htdig
> code internals, this is just a quick-try-hack, but it seems to be
> working here but not heavily tested though...
> 
> By the way, I think that using regular expressions here is a way too
> big hammer for this simple task (i.e. just for testing if one string
> is equal to or just an extension of another). Robots.txt is not
> defined to contain regular expressions but htdig handles disallow
> lines as if they are regexps. Are you sure that won't cause any
> problems if somebody puts some "weird" characters in it?

I believe your patch will fix this problem correctly.  But I think you're
right about "weird" characters causing problems.  The pattern that's built
from disallow lines is given to HtRegex::set, not HtRegex::setEscaped, so
regular expression meta characters are taken as operational.  You're also
right about this being too big a hammer for the task.  The old way did
the job according to the standard, so that's ultimately what we should
go back to.

-- 
Gilles R. Detillieux              E-mail: <[EMAIL PROTECTED]>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/
Dept. Physiology, U. of Manitoba  Winnipeg, MB  R3E 3J7  (Canada)


-------------------------------------------------------
This sf.net email is sponsored by: See the NEW Palm 
Tungsten T handheld. Power & Color in a compact size!
http://ads.sourceforge.net/cgi-bin/redirect.pl?palm0001en
_______________________________________________
htdig-dev mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/htdig-dev

Reply via email to