Hi,
The more I look at CrawlDbReducer the less I like the method it uses to
select the most recent records.
This selection is primarily made in the while() loop in
CrawlDbReducer:45. My main objection is that selecting the "highest"
value (meaning "most recent") relies on the fact that values of status
codes in CrawlDatum are ordered according to their meaning, and they are
treated as a sort of state machine. However, adding new states is very
difficult, if they should have values lower than STATUS_FETCH_GONE, as
it leads to breaking backwards-compatibility with older segment data.
Adding status codes with higher values may also break things here,
because a CrawlDatum with the highest code would not be necessarily the
most recent.
I encountered this problem first when adding the signature framework,
fortunately there was one unused value (0) at that time, so I could add
CrawlDatum.STATUS_SIGNATURE without breaking the assumptions in
CrawlDbReducer.
However, now things become more difficult:
* we need another status code for newly discovered pages discovered as a
result of redirection (see the thread on "Meta-refresh"). If we add this
status as e.g. STATUS_FETCH_REDIRECT = 8, then the logic in
CrawlDbReducer will break.
* we need something to mark pages as "being on a fetchlist, to be
updated soon" (this is to support multiple parallel
generate/fetch/update cycles). A new status code would do fine for this
purpose (although we need an expiry timer for that too). Arguably, we
could use the same trick that we used in 0.7 (moving next fetch time 1
week into the future), but I'm not sure yet how it would play with the
adaptive fetch patches, which manipulate this value too...
I could use a hack in the meantime: status values are for now all below
128, we could use the upper nibble for these additional flags, and mask
them out with 0x0f.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com