[ http://issues.apache.org/jira/browse/NUTCH-416?page=all ]
Andrzej Bialecki closed NUTCH-416. ----------------------------------- Resolution: Fixed Fixed in trunk, rev. 490607. As a side effect it is now possible to correctly update CrawlDB from multiple segments, even if they contain duplicate pages - the code in CrawlDbReducer will correctly apply only the latest version of CrawlDatum. > CrawlDatum status and CrawlDbReducer refactoring > ------------------------------------------------ > > Key: NUTCH-416 > URL: http://issues.apache.org/jira/browse/NUTCH-416 > Project: Nutch > Issue Type: Improvement > Affects Versions: 0.9.0 > Reporter: Andrzej Bialecki > Assigned To: Andrzej Bialecki > Fix For: 0.9.0 > > > CrawlDatum needs more status codes, e.g. to reflect redirected pages. > However, current values of status codes are linear, which prevents us from > adding new codes in proper places. This is also related to the logic in > CrawlDbReducer, which makes decisions based on arithmetic ordering of status > code values. > I propose to change the codes so that they are grouped into related values, > with significant gaps between groups for adding new codes without causing > significant reordering. I also propose to change the logic in CrawlDbReducer > so that its operation is not so dependent on actual code values. > A mapping should also be added between old and new codes to facilitate > backward-compatibility of existing data. This mapping should be applied on > the fly, without requiring explicit data conversion. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira