[
https://issues.apache.org/jira/browse/NUTCH-1457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13711510#comment-13711510
]
Riyaz Shaik edited comment on NUTCH-1457 at 7/17/13 7:34 PM:
-------------------------------------------------------------
Hi Ferdy,
The below mentioned scenario will not occur:
*although there might be a problem with code that assumes STATUS_FETCHED, for
example the ParserJob: It only processes STATUS_FETCHED entries. There may be
more dependencies.*
Since we are not allowing to put the *??GENERATE_MARK??* for the urls, whose
*??fetchtime > currentTime??* in GeneratorReducer. So that those urls will not
be processed in the Fetcher/Parser jobs.
One of the drawaback of this solution(UNSCHEDULED status/mark in
GeneratorMapper) could be "We are updating the few columns data of all the urls
(SCHEDULED + UNSCHEDULED) in Hbase" from ??GeneratorReducer??, that might
reduce the ??GeneratorReducer?? performance.
We have done the changes suggested by you(Instead of UNSCHEDULED Status/Marker
use SCHEDULED marker).Had added the SCHEDULED marker in *??GeneratorReducer??*.
It is working fine and also it overcomes the drawback of our earlier solution.
Will attach the code changes.
Thanks Ferdy.. :)
was (Author: riyaz):
Hi Ferdy,
The below mentioned scenario will not occur:
*although there might be a problem with code that assumes STATUS_FETCHED, for
example the ParserJob: It only processes STATUS_FETCHED entries. There may be
more dependencies.*
Since we are not allowing to put the *??GENERATE_MARK??* for the urls, whose
*??fetchtime > currentTime??* in GeneratorReducer. So that those urls will not
be processed in the Fetcher/Parser jobs.
One of the drawaback of this solution could be "We are updating the few
columns data of all the urls (SCHEDULED + UNSCHEDULED) in Hbase" from
??GeneratorReducer??, that might reduce the ??GeneratorReducer?? performance.
We have done the changes suggested by you(Instead of UNSCHEDULED Status/Marker
use SCHEDULED marker).Had added the SCHEDULED marker in *??GeneratorReducer??*.
It is working fine and also it overcomes the drawback of our earlier solution.
Will attach the code changes.
Thanks Ferdy.. :)
> Nutch2 Refactor the update process so that fetched items are only processed
> once
> --------------------------------------------------------------------------------
>
> Key: NUTCH-1457
> URL: https://issues.apache.org/jira/browse/NUTCH-1457
> Project: Nutch
> Issue Type: Improvement
> Reporter: Ferdy Galema
> Fix For: 2.4
>
> Attachments: CrawlStatus.java, DbUpdateReducer.java,
> GeneratorMapper.java, GeneratorReducer.java
>
>
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira