[Nutch-dev] Re: Much faster RegExp lib needed in nutch?

Matt Kangas Mon, 13 Mar 2006 10:59:35 -0800

I've been watching discussion of faster regex libs with muchinterest. But if regex speed seems to be a problem, would using lessregexes be a good answer?

Protocol and extension filtering could be done by another URLFilterplugin that is dedicated to this task, and uses more lightweightstring-chopping techniques. That way full regex support could beretained for the tasks where it's really needed.



On Mar 13, 2006, at 12:31 PM, Howie Wang wrote:

I have made some quick tests with regex-urlfilter...
The major problem is that it doen't use the  Perl syntax...
For instance, ît doesn't support the boundary matchers ^ and $(which are
used in nutch)


Are there other ways to match start/end of string in the other
regex library? I use "^http" a lot because a lot of sites pass around
urls in the query string, and I don't want them (eg.
http://del.icio.us/howie?url=http://lucene.apache.org/nutch)

Howie


--
Matt Kangas / [EMAIL PROTECTED]




-------------------------------------------------------
This SF.Net email is sponsored by xPML, a groundbreaking scripting language
that extends applications into web and mobile media. Attend the live webcast
and join the prime developer group breaking into this new coding territory!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid0944&bid$1720&dat1642
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

[Nutch-dev] Re: Much faster RegExp lib needed in nutch?

Reply via email to