Hi,

The more I look at CrawlDbReducer the less I like the method it uses to select the most recent records.

This selection is primarily made in the while() loop in CrawlDbReducer:45. My main objection is that selecting the "highest" value (meaning "most recent") relies on the fact that values of status codes in CrawlDatum are ordered according to their meaning, and they are treated as a sort of state machine. However, adding new states is very difficult, if they should have values lower than STATUS_FETCH_GONE, as it leads to breaking backwards-compatibility with older segment data. Adding status codes with higher values may also break things here, because a CrawlDatum with the highest code would not be necessarily the most recent.

I encountered this problem first when adding the signature framework, fortunately there was one unused value (0) at that time, so I could add CrawlDatum.STATUS_SIGNATURE without breaking the assumptions in CrawlDbReducer.

However, now things become more difficult:

* we need another status code for newly discovered pages discovered as a result of redirection (see the thread on "Meta-refresh"). If we add this status as e.g. STATUS_FETCH_REDIRECT = 8, then the logic in CrawlDbReducer will break.

* we need something to mark pages as "being on a fetchlist, to be updated soon" (this is to support multiple parallel generate/fetch/update cycles). A new status code would do fine for this purpose (although we need an expiry timer for that too). Arguably, we could use the same trick that we used in 0.7 (moving next fetch time 1 week into the future), but I'm not sure yet how it would play with the adaptive fetch patches, which manipulate this value too...

I could use a hack in the meantime: status values are for now all below 128, we could use the upper nibble for these additional flags, and mask them out with 0x0f.

--
Best regards,
Andrzej Bialecki     <><
___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Reply via email to