Anton Potehin wrote:
1. We have found these flags in CrawlDatum class:
public static final byte STATUS_SIGNATURE = 0;
public static final byte STATUS_DB_UNFETCHED = 1;
public static final byte STATUS_DB_FETCHED = 2;
public static final byte STATUS_DB_GONE = 3;
public static final byte STATUS_LINKED = 4;
public static final byte STATUS_FETCH_SUCCESS = 5;
public static final byte STATUS_FETCH_RETRY = 6;
public static final byte STATUS_FETCH_GONE = 7;
Though the names of these flags describe their aims, it is not clear
completely what they mean and what is the difference between
STATUS_DB_FETCHED and STATUS_FETCH_SUCCESS for example.
The STATUS_DB_* codes are used in entries in the crawldb.
STATUS_FETCH_* codes are used in fetcher output. STATUS_LINKED is used
in parser output for urls that are linked to. A crawldb update combines
all of these (the old version of the db, plus fetcher and parser output)
to generate a new version of the db, containing only STATUS_DB_*
entries. This logic is in CrawlDbReducer.
Does that help?
Doug