Anton Potehin wrote:
1. We have found these flags in CrawlDatum class:
  public static final byte STATUS_SIGNATURE = 0;
  public static final byte STATUS_DB_UNFETCHED = 1;
  public static final byte STATUS_DB_FETCHED = 2;
  public static final byte STATUS_DB_GONE = 3;
  public static final byte STATUS_LINKED = 4;
  public static final byte STATUS_FETCH_SUCCESS = 5;
  public static final byte STATUS_FETCH_RETRY = 6;
  public static final byte STATUS_FETCH_GONE = 7;

Though the names of these flags describe their aims, it is not clear
completely what they mean and what is the difference between
STATUS_DB_FETCHED and STATUS_FETCH_SUCCESS for example.

The STATUS_DB_* codes are used in entries in the crawldb. STATUS_FETCH_* codes are used in fetcher output. STATUS_LINKED is used in parser output for urls that are linked to. A crawldb update combines all of these (the old version of the db, plus fetcher and parser output) to generate a new version of the db, containing only STATUS_DB_* entries. This logic is in CrawlDbReducer.

Does that help?

Doug

Reply via email to