> >1. Keeps the well-known perl syntax for regexp (and then find a way to
> >"simulate" them with automaton "limited" syntax) ?
> My vote would be for option 1. It's less work for everyone
> (except for the person incorporating the new library :)


That's my prefered solution too.
The first challenge is to see how to translate the  regexp used in default
regexp-urlfilter
templates provided by Nutch.
For now, in the only thing I don't see how to translate from perl syntax to
dk.brics.automaton syntax is this regexp:
-.*(/.+?)/.*?\1/.*?\1/.*
In fact, automaton doesn't support capturing groups (Anders Moeller has
confirmed).
We cannot remove this regexp from urlfilter, but we cannot handle it with
automaton.
So, two solutions:
1. Keep java regexp ...
2. Switch to automaton and provide a java implementation of this regexp (it
is more a protection pattern than really a filter pattern, and it could
probably be hard-coded).

I'm waiting for your suggestions...

Regards

Jérôme

 --
http://motrech.free.fr/
http://www.frutch.org/

Reply via email to