I've been watching discussion of faster regex libs with much interest. But if regex speed seems to be a problem, would using less regexes be a good answer?

Protocol and extension filtering could be done by another URLFilter plugin that is dedicated to this task, and uses more lightweight string-chopping techniques. That way full regex support could be retained for the tasks where it's really needed.


On Mar 13, 2006, at 12:31 PM, Howie Wang wrote:


I have made some quick tests with regex-urlfilter...
The major problem is that it doen't use the  Perl syntax...
For instance, ît doesn't support the boundary matchers ^ and $ (which are
used in nutch)

Are there other ways to match start/end of string in the other
regex library? I use "^http" a lot because a lot of sites pass around
urls in the query string, and I don't want them (eg.
http://del.icio.us/howie?url=http://lucene.apache.org/nutch)

Howie

--
Matt Kangas / [EMAIL PROTECTED]




-------------------------------------------------------
This SF.Net email is sponsored by xPML, a groundbreaking scripting language
that extends applications into web and mobile media. Attend the live webcast
and join the prime developer group breaking into this new coding territory!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid0944&bid$1720&dat1642
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to