> >1. Keeps the well-known perl syntax for regexp (and then find a way to
>"simulate" them with automaton "limited" syntax) ?
My vote would be for option 1. It's less work for everyone
> (except for the person incorporating the new library :)
That's my prefered solution too.
The first challenge is to see how to translate the regexp used in default
regexp-urlfilter
templates provided by Nutch.
For now, in the only thing I don't see how to translate from perl syntax to
dk.brics.automaton syntax is this regexp:
-.*(/.+?)/.*?\1/.*?\1/.*
In fact, automaton doesn't support capturing groups (Anders Moeller has
confirmed).
We cannot remove this regexp from urlfilter, but we cannot handle it with
automaton.
So, two solutions:
1. Keep java regexp ...
2. Switch to automaton and provide a java implementation of this regexp (it
is more a protection pattern than really a filter pattern, and it could
probably be hard-coded).
I'm waiting for your suggestions...
I've pinged Terence Parr - ANTLR author. I heard that the new version
(ANTLR 3) has a fast FSM inside it. If so, somebody could write an
ANTLR grammar to convert the Nutch regex into another ANTLR grammar
that, when processed by ANLTR, creates a URL parser/validator.
It's almost too easy... :)
Anyway, waiting to hear back from Ter.
-- Ken
--
Ken Krugler
Krugle, Inc.
+1 530-210-6378
"Find Code, Find Answers"
-------------------------------------------------------
This SF.Net email is sponsored by xPML, a groundbreaking scripting language
that extends applications into web and mobile media. Attend the live webcast
and join the prime developer group breaking into this new coding territory!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers