Re: CrawlDbReducer - selecting data for DB update

2006-04-07 Thread Doug Cutting

Andrzej Bialecki wrote:
This selection is primarily made in the while() loop in 
CrawlDbReducer:45. My main objection is that selecting the "highest" 
value (meaning "most recent") relies on the fact that values of status 
codes in CrawlDatum are ordered according to their meaning, and they are 
treated as a sort of state machine.


Yes, that was the design, that status codes are also priorities.

However, adding new states is very 
difficult, if they should have values lower than STATUS_FETCH_GONE, as 
it leads to breaking backwards-compatibility with older segment data. 


We can use CrawlDatum.VERSION to insert new status codes 
back-compatibly.  Perhaps we should change the codes to, instead of [0, 
1, 2, ...] to be [0, 10, 20, 30, ...] so that we can more easily 
introduce new values?  To update status codes from older versions we 
simply multiply by 10.


Would something like that work?

Or we could have a separate table mapping status codes to priority.

Doug


CrawlDbReducer - selecting data for DB update

2006-04-07 Thread Andrzej Bialecki

Hi,

The more I look at CrawlDbReducer the less I like the method it uses to 
select the most recent records.


This selection is primarily made in the while() loop in 
CrawlDbReducer:45. My main objection is that selecting the "highest" 
value (meaning "most recent") relies on the fact that values of status 
codes in CrawlDatum are ordered according to their meaning, and they are 
treated as a sort of state machine. However, adding new states is very 
difficult, if they should have values lower than STATUS_FETCH_GONE, as 
it leads to breaking backwards-compatibility with older segment data. 
Adding status codes with higher values may also break things here, 
because a CrawlDatum with the highest code would not be necessarily the 
most recent.


I encountered this problem first when adding the signature framework, 
fortunately there was one unused value (0) at that time, so I could add 
CrawlDatum.STATUS_SIGNATURE without breaking the assumptions in 
CrawlDbReducer.


However, now things become more difficult:

* we need another status code for newly discovered pages discovered as a 
result of redirection (see the thread on "Meta-refresh"). If we add this 
status as e.g. STATUS_FETCH_REDIRECT = 8, then the logic in 
CrawlDbReducer will break.


* we need something to mark pages as "being on a fetchlist, to be 
updated soon" (this is to support multiple parallel 
generate/fetch/update cycles). A new status code would do fine for this 
purpose (although we need an expiry timer for that too). Arguably, we 
could use the same trick that we used in 0.7 (moving next fetch time 1 
week into the future), but I'm not sure yet how it would play with the 
adaptive fetch patches, which manipulate this value too...


I could use a hack in the meantime: status values are for now all below 
128, we could use the upper nibble for these additional flags, and mask 
them out with 0x0f.


--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com