[jira] Commented: (NUTCH-416) CrawlDatum status and CrawlDbReducer refactoring

Doug Cook (JIRA) Wed, 20 Dec 2006 14:40:45 -0800

    [ 
http://issues.apache.org/jira/browse/NUTCH-416?page=comments#action_12460080 ] 
            
Doug Cook commented on NUTCH-416:
---------------------------------


You may also want to make the status codes ORed values, so that, for example, 
all of the various kinds of failure all have a FAILURE code ORed in, making it 
clean & easy in the code to check for "any failure case" while still allowing 
different failure codes. So at  the lowest levels, the values might be things 
like FAILED, FETCHED, and UNFETCHED, while REDIRECT might be (FETCHED | 
something), specific redirect codes would be (REDIRECT | something), specific 
failure codes would be (FAILED | something), etc. This way we can keep all of 
the specific failure codes, all the specific redirect codes, etc. while making 
the code cleaner and more reliable. We won't have to worry about keeping range 
checks or switch statements in sync if we add new codes; a statement like
   if (code & FAILED != 0) {
   }
will always tell us whether a URL fetch failed, regardless of how many codes we 
add. The way the code currently is, adding status codes is likely to break 
things if one is not careful to go through every place where status codes are 
examined to ensure that the new code is properly accounted for.

While you're changing the CrawlDatum, it might also make sense to store a 
second URL,e.g. that of the redirect target. I have a hunch this will be very 
useful.

Just some thoughts. Thanks for making this happen.

Doug



> CrawlDatum status and CrawlDbReducer refactoring
> ------------------------------------------------
>
>                 Key: NUTCH-416
>                 URL: http://issues.apache.org/jira/browse/NUTCH-416
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 0.9.0
>            Reporter: Andrzej Bialecki 
>         Assigned To: Andrzej Bialecki 
>             Fix For: 0.9.0
>
>
> CrawlDatum needs more status codes, e.g. to reflect redirected pages. 
> However, current values of status codes are linear, which prevents us from 
> adding new codes in proper places. This is also related to the logic in 
> CrawlDbReducer, which makes decisions based on arithmetic ordering of status 
> code values.
> I propose to change the codes so that they are grouped into related values, 
> with significant gaps between groups for adding new codes without causing 
> significant reordering. I also propose to change the logic in CrawlDbReducer 
> so that its operation is not so dependent on actual code values.
> A mapping should also be added between old and new codes to facilitate 
> backward-compatibility of existing data. This mapping should be applied on 
> the fly, without requiring explicit data conversion.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-416) CrawlDatum status and CrawlDbReducer refactoring

Reply via email to