[ http://issues.apache.org/jira/browse/NUTCH-416?page=all ]

Andrzej Bialecki  closed NUTCH-416.
-----------------------------------

    Resolution: Fixed

Fixed in trunk, rev. 490607. As a side effect it is now possible to correctly 
update CrawlDB from multiple segments, even if they contain duplicate pages - 
the code in CrawlDbReducer will correctly apply only the latest version of 
CrawlDatum.

> CrawlDatum status and CrawlDbReducer refactoring
> ------------------------------------------------
>
>                 Key: NUTCH-416
>                 URL: http://issues.apache.org/jira/browse/NUTCH-416
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 0.9.0
>            Reporter: Andrzej Bialecki 
>         Assigned To: Andrzej Bialecki 
>             Fix For: 0.9.0
>
>
> CrawlDatum needs more status codes, e.g. to reflect redirected pages. 
> However, current values of status codes are linear, which prevents us from 
> adding new codes in proper places. This is also related to the logic in 
> CrawlDbReducer, which makes decisions based on arithmetic ordering of status 
> code values.
> I propose to change the codes so that they are grouped into related values, 
> with significant gaps between groups for adding new codes without causing 
> significant reordering. I also propose to change the logic in CrawlDbReducer 
> so that its operation is not so dependent on actual code values.
> A mapping should also be added between old and new codes to facilitate 
> backward-compatibility of existing data. This mapping should be applied on 
> the fly, without requiring explicit data conversion.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to