[Nutch-dev] Re: Much faster RegExp lib needed in nutch?

Stefan Groschupf Thu, 16 Mar 2006 10:54:08 -0800

Beside that, we may should add a kind of timeout to the url filter ingeneral.Since it can happen that a user configure a regex for his nutch setupthat run in the same problem as we had run right now.

Something like below attached.

Would you agree? I can create a serious patch and test it if we areinterested to add this as a fail back into the sources.

At least this would save nutch against wrong user configurations. :-)

Index: src/plugin/urlfilter-regex/src/java/org/apache/nutch/net/RegexURLFilter.java

===================================================================

--- src/plugin/urlfilter-regex/src/java/org/apache/nutch/net/RegexURLFilter.java (revision 383682)+++ src/plugin/urlfilter-regex/src/java/org/apache/nutch/net/RegexURLFilter.java (working copy)

@@ -75,14 +75,20 @@
   public synchronized String filter(String url) {
     Iterator i=rules.iterator();
+    MatcherThread mt;
     while(i.hasNext()) {
-      Rule r=(Rule) i.next();
-      Matcher matcher = r.pattern.matcher(url);
-
-      if (matcher.find()) {
-        //System.out.println("Matched " + r.regex);
-        return r.sign ? url : null;
-      }
+      mt = new MatcherThread();
+      mt.rule=(Rule) i.next();
+      mt.start();
+      try {
+        synchronized (mt.monitor) {
+          if (!mt.done) {
+            mt.monitor.wait(1000);
+          }
+        }
+      } catch (InterruptedException e) {}
+      mt.stop();
+      return mt.found ? url : null;
     };

     return null;   // assume no go
@@ -87,6 +93,24 @@

     return null;   // assume no go
   }
+
+  class MatcherThread extends Thread {
+    private Object monitor = new Object();
+    private String url;
+    private Rule rule;
+    private boolean found = false;
+    private boolean done = false;
+    public void run() {
+      Matcher matcher = this.rule.pattern.matcher(url);
+      if (matcher.find()) {
+        this.found = rule.sign;
+      }
+      synchronized (monitor) {
+        this.monitor.notify();
+        this.done = true;
+      }
+    }
+  }
   //
   // Format of configuration file is


Am 16.03.2006 um 18:10 schrieb Jérôme Charron:

1. Keeps the well-known perl syntax for regexp (and then find away to
"simulate" them with automaton "limited" syntax) ?
My vote would be for option 1. It's less work for everyone
(except for the person incorporating the new library :)
That's my prefered solution too.
The first challenge is to see how to translate the regexp used indefault
regexp-urlfilter
templates provided by Nutch.
For now, in the only thing I don't see how to translate from perlsyntax to
dk.brics.automaton syntax is this regexp:
-.*(/.+?)/.*?\1/.*?\1/.*
In fact, automaton doesn't support capturing groups (Anders Moellerhas
confirmed).
We cannot remove this regexp from urlfilter, but we cannot handleit with
automaton.
So, two solutions:
1. Keep java regexp ...
2. Switch to automaton and provide a java implementation of thisregexp (itis more a protection pattern than really a filter pattern, and itcould
probably be hard-coded).

I'm waiting for your suggestions...

Regards

Jérôme

 --
http://motrech.free.fr/
http://www.frutch.org/


---------------------------------------------
blog: http://www.find23.org
company: http://www.media-style.com




-------------------------------------------------------
This SF.Net email is sponsored by xPML, a groundbreaking scripting language
that extends applications into web and mobile media. Attend the live webcast
and join the prime developer group breaking into this new coding territory!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid0944&bid$1720&dat1642
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

[Nutch-dev] Re: Much faster RegExp lib needed in nutch?

Reply via email to