Beside that, we may should add a kind of timeout to the url filter in
general.
Since it can happen that a user configure a regex for his nutch setup
that run in the same problem as we had run right now.
Something like below attached.
Would you agree? I can create a serious patch and test it if we are
interested to add this as a fail back into the sources.
At least this would save nutch against wrong user configurations. :-)
Index: src/plugin/urlfilter-regex/src/java/org/apache/nutch/net/
RegexURLFilter.java
===================================================================
--- src/plugin/urlfilter-regex/src/java/org/apache/nutch/net/
RegexURLFilter.java (revision 383682)
+++ src/plugin/urlfilter-regex/src/java/org/apache/nutch/net/
RegexURLFilter.java (working copy)
@@ -75,14 +75,20 @@
public synchronized String filter(String url) {
Iterator i=rules.iterator();
+ MatcherThread mt;
while(i.hasNext()) {
- Rule r=(Rule) i.next();
- Matcher matcher = r.pattern.matcher(url);
-
- if (matcher.find()) {
- //System.out.println("Matched " + r.regex);
- return r.sign ? url : null;
- }
+ mt = new MatcherThread();
+ mt.rule=(Rule) i.next();
+ mt.start();
+ try {
+ synchronized (mt.monitor) {
+ if (!mt.done) {
+ mt.monitor.wait(1000);
+ }
+ }
+ } catch (InterruptedException e) {}
+ mt.stop();
+ return mt.found ? url : null;
};
return null; // assume no go
@@ -87,6 +93,24 @@
return null; // assume no go
}
+
+ class MatcherThread extends Thread {
+ private Object monitor = new Object();
+ private String url;
+ private Rule rule;
+ private boolean found = false;
+ private boolean done = false;
+ public void run() {
+ Matcher matcher = this.rule.pattern.matcher(url);
+ if (matcher.find()) {
+ this.found = rule.sign;
+ }
+ synchronized (monitor) {
+ this.monitor.notify();
+ this.done = true;
+ }
+ }
+ }
//
// Format of configuration file is
Am 16.03.2006 um 18:10 schrieb Jérôme Charron:
1. Keeps the well-known perl syntax for regexp (and then find a
way to
"simulate" them with automaton "limited" syntax) ?
My vote would be for option 1. It's less work for everyone
(except for the person incorporating the new library :)
That's my prefered solution too.
The first challenge is to see how to translate the regexp used in
default
regexp-urlfilter
templates provided by Nutch.
For now, in the only thing I don't see how to translate from perl
syntax to
dk.brics.automaton syntax is this regexp:
-.*(/.+?)/.*?\1/.*?\1/.*
In fact, automaton doesn't support capturing groups (Anders Moeller
has
confirmed).
We cannot remove this regexp from urlfilter, but we cannot handle
it with
automaton.
So, two solutions:
1. Keep java regexp ...
2. Switch to automaton and provide a java implementation of this
regexp (it
is more a protection pattern than really a filter pattern, and it
could
probably be hard-coded).
I'm waiting for your suggestions...
Regards
Jérôme
--
http://motrech.free.fr/
http://www.frutch.org/
---------------------------------------------
blog: http://www.find23.org
company: http://www.media-style.com
-------------------------------------------------------
This SF.Net email is sponsored by xPML, a groundbreaking scripting language
that extends applications into web and mobile media. Attend the live webcast
and join the prime developer group breaking into this new coding territory!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid0944&bid$1720&dat1642
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers