Andrzej Bialecki wrote:
This selection is primarily made in the while() loop in CrawlDbReducer:45. My main objection is that selecting the "highest" value (meaning "most recent") relies on the fact that values of status codes in CrawlDatum are ordered according to their meaning, and they are treated as a sort of state machine.

Yes, that was the design, that status codes are also priorities.

However, adding new states is very difficult, if they should have values lower than STATUS_FETCH_GONE, as it leads to breaking backwards-compatibility with older segment data.

We can use CrawlDatum.VERSION to insert new status codes back-compatibly. Perhaps we should change the codes to, instead of [0, 1, 2, ...] to be [0, 10, 20, 30, ...] so that we can more easily introduce new values? To update status codes from older versions we simply multiply by 10.

Would something like that work?

Or we could have a separate table mapping status codes to priority.

Doug


-------------------------------------------------------
This SF.Net email is sponsored by xPML, a groundbreaking scripting language
that extends applications into web and mobile media. Attend the live webcast
and join the prime developer group breaking into this new coding territory!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to