[ https://issues.apache.org/jira/browse/NUTCH-596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Doğacan Güney updated NUTCH-596: -------------------------------- Priority: Minor (was: Major) Reducing priority to minor. > ParseSegments parse content even if its not CrawlDatum.STATUS_FETCH_SUCCESS > --------------------------------------------------------------------------- > > Key: NUTCH-596 > URL: https://issues.apache.org/jira/browse/NUTCH-596 > Project: Nutch > Issue Type: Bug > Affects Versions: 0.9.0 > Reporter: Emmanuel Joke > Assignee: Doğacan Güney > Priority: Minor > Attachments: NUTCH-596_v1.patch > > > We have 2 choices to parse the content either within the Fetcher class or > with the ParseSegment class > Fetcher(1 or 2) will check first if the CrawlDatum == STATUS_FETCH_SUCCESS > nad if its true it will parse the content. > However we don't have this check in ParseSegment, thus we parse every content > store on the disk without checking the Status. > So i think we should implement this check, i can see only 3 solutions: > - read the status code in the Metadata of the Content object > - don't store content for fetch with a crawldatun <> STATUS_FETCH_SUCCESS > - load the crawldatum object in ParseSegement > What are your thoughts ? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.