[ 
http://issues.apache.org/jira/browse/NUTCH-416?page=comments#action_12460091 ] 
            
Andrzej Bialecki  commented on NUTCH-416:
-----------------------------------------

There are two main distinct groups of status codes, but not along the lines of 
success/failure - these are DB and Fetch status codes. Additionally, the number 
of available bits for a bitmask is very small, because the status needs to fit 
in a byte.

My patch in progress contains the following now:

  public static final byte STATUS_DB_UNFETCHED      = 0x01;
  public static final byte STATUS_DB_FETCHED        = 0x02;
  public static final byte STATUS_DB_GONE           = 0x03;
  public static final byte STATUS_DB_REDIR_TEMP     = 0x04;
  public static final byte STATUS_DB_REDIR_PERM     = 0x05;
  
  /** Maximum value of DB-related status. */
  public static final byte STATUS_DB_MAX            = 0x1f;
  
  public static final byte STATUS_FETCH_SUCCESS     = 0x21;
  public static final byte STATUS_FETCH_RETRY       = 0x22;
  public static final byte STATUS_FETCH_REDIR_TEMP  = 0x23;
  public static final byte STATUS_FETCH_REDIR_PERM  = 0x24;
  public static final byte STATUS_FETCH_GONE        = 0x25;
  
  /** Maximum value of fetch-related status. */
  public static final byte STATUS_FETCH_MAX         = 0x3f;
  
  public static final byte STATUS_SIGNATURE         = 0x41;
  public static final byte STATUS_INJECTED          = 0x42;
  public static final byte STATUS_LINKED            = 0x43;
  
  public static boolean hasDbStatus(CrawlDatum datum) {
    if (datum.status <= STATUS_DB_MAX) return true;
    return false;
  }

  public static boolean hasFetchStatus(CrawlDatum datum) {
    if (datum.status > STATUS_DB_MAX && datum.status <= STATUS_FETCH_MAX) 
return true;
    return false;
  }

... so, I went with ranges of values. The most unwieldy switch() statements in 
the current code were related to the checking between DB or Fetch status, and 
the above two static methods handle this and simplify the code.

Regarding the redirect URL - because of space constraints I'd rather use 
Metadata for this. We already handle metadata efficiently, so that performance 
doesn't suffer if we don't have any metadata to keep. It would make sense, 
though, to have a predefined key for this URL.

> CrawlDatum status and CrawlDbReducer refactoring
> ------------------------------------------------
>
>                 Key: NUTCH-416
>                 URL: http://issues.apache.org/jira/browse/NUTCH-416
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 0.9.0
>            Reporter: Andrzej Bialecki 
>         Assigned To: Andrzej Bialecki 
>             Fix For: 0.9.0
>
>
> CrawlDatum needs more status codes, e.g. to reflect redirected pages. 
> However, current values of status codes are linear, which prevents us from 
> adding new codes in proper places. This is also related to the logic in 
> CrawlDbReducer, which makes decisions based on arithmetic ordering of status 
> code values.
> I propose to change the codes so that they are grouped into related values, 
> with significant gaps between groups for adding new codes without causing 
> significant reordering. I also propose to change the logic in CrawlDbReducer 
> so that its operation is not so dependent on actual code values.
> A mapping should also be added between old and new codes to facilitate 
> backward-compatibility of existing data. This mapping should be applied on 
> the fly, without requiring explicit data conversion.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to