Andrzej Bialecki wrote:
Mehmet Tan wrote:
Hi,
I want to ask a question about redirections. Correct me if I'm wrong
but if a page is redirected to a page that is already in the webdb,
then the
next updatedb operation will overwrite all previous info about refetch,
because it is a newly created page in the fetcher whose fetchInterval
is the initial
fetch interval. How does the adaptive refetch algorithm handle this
situation?
Yes, this is a bug, and it affects both the original and the patched
versions - fetch interval shouldn't be blindly copied from any new
CrawlDatum (this happens in CrawlDbReducer.java:86 in both versions),
instead it should be initialized with the value from
old.getFetchInterval(), if available. Please fix this in your version,
I'll fix this in the un-patched version.
Thanks for spotting this!
Please check the attached patch, it should properly copy all original
values first, and then only update those that are necessary.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
Index: CrawlDbReducer.java
===================================================================
--- CrawlDbReducer.java (revision 389791)
+++ CrawlDbReducer.java (working copy)
@@ -61,38 +61,38 @@
}
}
- CrawlDatum result = null;
+ CrawlDatum result = new CrawlDatum();
+ // initialize with previous values, also copy metadata from old
+ // and overlay them with new metadata
+ if (old != null) {
+ result.set(old);
+ result.getMetaData().putAll(highest.getMetaData());
+ } else {
+ result.set(highest);
+ }
switch (highest.getStatus()) { // determine new status
case CrawlDatum.STATUS_DB_UNFETCHED: // no new entry
case CrawlDatum.STATUS_DB_FETCHED:
case CrawlDatum.STATUS_DB_GONE:
- result = old; // use old
+ // use old
+ result = old;
break;
case CrawlDatum.STATUS_LINKED: // highest was link
- if (old != null) { // if old exists
- result = old; // use it
- } else {
- result = highest; // use new entry
+ if (old == null) {
result.setStatus(CrawlDatum.STATUS_DB_UNFETCHED);
- result.setScore(1.0f); // initial score is 1.0f
}
- result.setSignature(null); // reset the signature
break;
case CrawlDatum.STATUS_FETCH_SUCCESS: // succesful fetch
- result = highest; // use new entry
- if (highest.getSignature() == null) highest.setSignature(signature);
+ if (highest.getSignature() == null) result.setSignature(signature);
result.setStatus(CrawlDatum.STATUS_DB_FETCHED);
result.setNextFetchTime();
break;
case CrawlDatum.STATUS_FETCH_RETRY: // temporary failure
- result = highest; // use new entry
- if (old != null)
- result.setSignature(old.getSignature()); // use old signature
if (highest.getRetriesSinceFetch() < retryMax) {
result.setStatus(CrawlDatum.STATUS_DB_UNFETCHED);
} else {
@@ -101,9 +101,6 @@
break;
case CrawlDatum.STATUS_FETCH_GONE: // permanent failure
- result = highest; // use new entry
- if (old != null)
- result.setSignature(old.getSignature()); // use old signature
result.setStatus(CrawlDatum.STATUS_DB_GONE);
break;
@@ -111,10 +108,8 @@
throw new RuntimeException("Unknown status: "+highest.getStatus());
}
- if (result != null) {
- result.setScore(result.getScore() + scoreIncrement);
- output.collect(key, result);
- }
+ result.setScore(result.getScore() + scoreIncrement);
+ output.collect(key, result);
}
}