Hello,

I am attaching the patch for RegexUrlNormalizer and RegexUrlFilter - it should reduce synchronization of threads during fetching. I took me quite long to do it as I was quite busy at work but finally I did it - I tested it downloading about 1mln URLs in 200 fetcher threads and it was running without problems. I have used ThreadLocal to make it compatible with changes in BasicUrlNormalizer committed by Doug.
Regards
Piotr Kosiorowski




Doug Cutting wrote:

Piotr Kosiorowski wrote:

I had a look at oro code and it looks like Perl5Matcher construction is not expensive at all. Maybe GC time would be increased due to frequent object construction but I do not think it shoudl create a problem. I didn't wrote any performance tests to check if using ThreadLocal is faster than construction of new matcher in this case, but for me the important thing is it now works correctly.


Perhaps this was a premature optimization. I did not actually perform any benchmarking. I just assumed that it would be significantly faster to avoid allocating objects each time a URL is tested.

I have additional question - when we look at RegexUrlNormalizer it has normalize() method synchronized. The only thing inside this method that needs synchronization is exactly the same oro usage problem. Because normalize() is synchronized we do not have exceptions but fetcher threads might be slowed down due to synchronization.
Exactly the same solution is used in RegexUrlFilter.
If someone who knows this code can crosscheck my findings, and thinks it might be useful to change it I can prepare a patch for it (using ThreadLocal to have the same solution in all places).


Your analysis sounds good to me. Please send along a patch, ideally as an attachment.

Cheers,

Doug


------------------------------------------------------- The SF.Net email is sponsored by: Beat the post-holiday blues Get a FREE limited edition SourceForge.net t-shirt from ThinkGeek. It's fun and FREE -- well, almost....http://www.thinkgeek.com/sfshirt _______________________________________________ Nutch-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-developers





---------------------------------------------------------------------- Startuj z INTERIA.PL!!! >>> http://link.interia.pl/f1837
Index: java/net/nutch/net/RegexURLFilter.java
===================================================================
RCS file: /cvsroot/nutch/nutch/src/java/net/nutch/net/RegexURLFilter.java,v
retrieving revision 1.3
diff -u -r1.3 RegexURLFilter.java
--- java/net/nutch/net/RegexURLFilter.java      5 May 2004 18:23:08 -0000       
1.3
+++ java/net/nutch/net/RegexURLFilter.java      30 Jan 2005 05:05:09 -0000
@@ -45,7 +45,11 @@
   }
 
   private List rules;
-  private PatternMatcher matcher = new Perl5Matcher();
+  private ThreadLocal matchers = new ThreadLocal() {
+      protected synchronized Object initialValue() {
+          return new Perl5Matcher();
+        }
+      };
 
   public RegexURLFilter() throws IOException, MalformedPatternException {
     String file = NutchConf.get("urlfilter.regex.file");
@@ -63,7 +67,8 @@
     rules = readConfigurationFile(new FileReader(filename));
   }
 
-  public synchronized String filter(String url) {
+  public String filter(String url) {
+    Perl5Matcher matcher = (Perl5Matcher)matchers.get();
     Iterator i=rules.iterator();
     while(i.hasNext()) {
       Rule r=(Rule) i.next();
@@ -113,7 +118,8 @@
       String regex=line.substring(1);
 
       Rule rule=new Rule();
-      rule.pattern=(Perl5Pattern) compiler.compile(regex);
+      rule.pattern=(Perl5Pattern) 
+          compiler.compile(regex,Perl5Compiler.READ_ONLY_MASK);
       rule.sign=sign;
       rule.regex=regex;
       rules.add(rule);
Index: java/net/nutch/net/RegexUrlNormalizer.java
===================================================================
RCS file: /cvsroot/nutch/nutch/src/java/net/nutch/net/RegexUrlNormalizer.java,v
retrieving revision 1.1
diff -u -r1.1 RegexUrlNormalizer.java
--- java/net/nutch/net/RegexUrlNormalizer.java  7 Sep 2004 19:26:07 -0000       
1.1
+++ java/net/nutch/net/RegexUrlNormalizer.java  30 Jan 2005 05:05:09 -0000
@@ -42,7 +42,11 @@
     }
     
     private List rules;
-    private PatternMatcher matcher = new Perl5Matcher();
+    private ThreadLocal matchers = new ThreadLocal() {
+        protected synchronized Object initialValue() {
+            return new Perl5Matcher();
+          }
+        };
     
     /** Default constructor which gets the file name from either 
<tt>nutch-site.xml</tt>
       * or <tt>nutch-default.xml</tt> and reads that configuration file.  It 
stores the regex patterns
@@ -64,7 +68,8 @@
     
     /** This function does the replacements by iterating through all the regex 
patterns.
       * It accepts a string url as input and returns the altered string. */
-    public synchronized String regexNormalize(String urlString) {
+    public String regexNormalize(String urlString) {
+      Perl5Matcher matcher = (Perl5Matcher)matchers.get();
       Iterator i=rules.iterator();
       while(i.hasNext()) {
         Rule r=(Rule) i.next();
@@ -77,7 +82,7 @@
     /** Normalizes any URLs by calling super.basicNormalize()
       * and regexSub(). This is the function that gets called
       * elsewhere in Nutch. */
-    public synchronized String normalize(String urlString)
+    public String normalize(String urlString)
       throws MalformedURLException {
         urlString = super.normalize(urlString); // run basicNormalize first to 
ready for regexNormalize
         urlString = regexNormalize(urlString);
@@ -128,7 +133,8 @@
         }
         if (patternValue != null && subValue != null) {
           Rule rule=new Rule();
-          rule.pattern=(Perl5Pattern) compiler.compile(patternValue);
+          rule.pattern=(Perl5Pattern) 
+              compiler.compile(patternValue,Perl5Compiler.READ_ONLY_MASK);
           rule.substitution=subValue;
           rules.add(rule);
         }

Reply via email to