Andrzej Bialecki wrote:
Mehmet Tan wrote:
   Hi,
I want to ask a question about redirections. Correct me if I'm wrong
but if a page is redirected to a page that is already in the webdb, then the
next updatedb operation will overwrite all previous info about refetch,
because it is a newly created page in the fetcher whose fetchInterval is the initial fetch interval. How does the adaptive refetch algorithm handle this situation?

Yes, this is a bug, and it affects both the original and the patched versions - fetch interval shouldn't be blindly copied from any new CrawlDatum (this happens in CrawlDbReducer.java:86 in both versions), instead it should be initialized with the value from old.getFetchInterval(), if available. Please fix this in your version, I'll fix this in the un-patched version.

Thanks for spotting this!


Please check the attached patch, it should properly copy all original values first, and then only update those that are necessary.

--
Best regards,
Andrzej Bialecki     <><
___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Index: CrawlDbReducer.java
===================================================================
--- CrawlDbReducer.java (revision 389791)
+++ CrawlDbReducer.java (working copy)
@@ -61,38 +61,38 @@
       }
     }
 
-    CrawlDatum result = null;
+    CrawlDatum result = new CrawlDatum();
+    // initialize with previous values, also copy metadata from old
+    // and overlay them with new metadata
+    if (old != null) {
+      result.set(old);
+      result.getMetaData().putAll(highest.getMetaData());
+    } else {
+      result.set(highest);
+    }
 
     switch (highest.getStatus()) {                // determine new status
 
     case CrawlDatum.STATUS_DB_UNFETCHED:          // no new entry
     case CrawlDatum.STATUS_DB_FETCHED:
     case CrawlDatum.STATUS_DB_GONE:
-      result = old;                               // use old
+      // use old
+      result = old;
       break;
 
     case CrawlDatum.STATUS_LINKED:                // highest was link
-      if (old != null) {                          // if old exists
-        result = old;                             // use it
-      } else {
-        result = highest;                         // use new entry
+      if (old == null) {
         result.setStatus(CrawlDatum.STATUS_DB_UNFETCHED);
-        result.setScore(1.0f);                    // initial score is 1.0f
       }
-      result.setSignature(null);                  // reset the signature
       break;
       
     case CrawlDatum.STATUS_FETCH_SUCCESS:         // succesful fetch
-      result = highest;                           // use new entry
-      if (highest.getSignature() == null) highest.setSignature(signature);
+      if (highest.getSignature() == null) result.setSignature(signature);
       result.setStatus(CrawlDatum.STATUS_DB_FETCHED);
       result.setNextFetchTime();
       break;
 
     case CrawlDatum.STATUS_FETCH_RETRY:           // temporary failure
-      result = highest;                           // use new entry
-      if (old != null)
-        result.setSignature(old.getSignature());  // use old signature
       if (highest.getRetriesSinceFetch() < retryMax) {
         result.setStatus(CrawlDatum.STATUS_DB_UNFETCHED);
       } else {
@@ -101,9 +101,6 @@
       break;
 
     case CrawlDatum.STATUS_FETCH_GONE:            // permanent failure
-      result = highest;                           // use new entry
-      if (old != null)
-        result.setSignature(old.getSignature());  // use old signature
       result.setStatus(CrawlDatum.STATUS_DB_GONE);
       break;
 
@@ -111,10 +108,8 @@
       throw new RuntimeException("Unknown status: "+highest.getStatus());
     }
     
-    if (result != null) {
-      result.setScore(result.getScore() + scoreIncrement);
-      output.collect(key, result);
-    }
+    result.setScore(result.getScore() + scoreIncrement);
+    output.collect(key, result);
   }
 
 }

Reply via email to