> >1. Keeps the well-known perl syntax for regexp (and then find a way to > >"simulate" them with automaton "limited" syntax) ? > My vote would be for option 1. It's less work for everyone > (except for the person incorporating the new library :)
That's my prefered solution too. The first challenge is to see how to translate the regexp used in default regexp-urlfilter templates provided by Nutch. For now, in the only thing I don't see how to translate from perl syntax to dk.brics.automaton syntax is this regexp: -.*(/.+?)/.*?\1/.*?\1/.* In fact, automaton doesn't support capturing groups (Anders Moeller has confirmed). We cannot remove this regexp from urlfilter, but we cannot handle it with automaton. So, two solutions: 1. Keep java regexp ... 2. Switch to automaton and provide a java implementation of this regexp (it is more a protection pattern than really a filter pattern, and it could probably be hard-coded). I'm waiting for your suggestions... Regards Jérôme -- http://motrech.free.fr/ http://www.frutch.org/