[ 
https://issues.apache.org/jira/browse/NUTCH-1457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13711510#comment-13711510
 ] 

Riyaz Shaik edited comment on NUTCH-1457 at 7/17/13 7:34 PM:
-------------------------------------------------------------

Hi Ferdy,

The below mentioned scenario will not occur:

 *although there might be a problem with code that assumes STATUS_FETCHED, for 
example the ParserJob: It only processes STATUS_FETCHED entries. There may be 
more dependencies.*

Since we are not allowing to put the *??GENERATE_MARK??* for the urls, whose 
*??fetchtime > currentTime??* in GeneratorReducer. So that those urls will not 
be processed in the Fetcher/Parser jobs.

One of the  drawaback of this solution(UNSCHEDULED status/mark in 
GeneratorMapper) could be "We are updating the few columns data of all the urls 
(SCHEDULED + UNSCHEDULED) in Hbase"  from ??GeneratorReducer??, that might 
reduce the ??GeneratorReducer?? performance.

We have done the changes suggested by you(Instead of UNSCHEDULED Status/Marker 
use SCHEDULED marker).Had added the SCHEDULED marker in *??GeneratorReducer??*. 
It is working fine and also it overcomes the drawback of our earlier solution.

Will attach the code changes.

Thanks Ferdy.. :)
                
      was (Author: riyaz):
    Hi Ferdy,

The below mentioned scenario will not occur:

 *although there might be a problem with code that assumes STATUS_FETCHED, for 
example the ParserJob: It only processes STATUS_FETCHED entries. There may be 
more dependencies.*

Since we are not allowing to put the *??GENERATE_MARK??* for the urls, whose 
*??fetchtime > currentTime??* in GeneratorReducer. So that those urls will not 
be processed in the Fetcher/Parser jobs.

One of the  drawaback of this solution could be "We are updating the few 
columns data of all the urls (SCHEDULED + UNSCHEDULED) in Hbase"  from 
??GeneratorReducer??, that might reduce the ??GeneratorReducer?? performance.

We have done the changes suggested by you(Instead of UNSCHEDULED Status/Marker 
use SCHEDULED marker).Had added the SCHEDULED marker in *??GeneratorReducer??*. 
It is working fine and also it overcomes the drawback of our earlier solution.

Will attach the code changes.

Thanks Ferdy.. :)
                  
> Nutch2 Refactor the update process so that fetched items are only processed 
> once
> --------------------------------------------------------------------------------
>
>                 Key: NUTCH-1457
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1457
>             Project: Nutch
>          Issue Type: Improvement
>            Reporter: Ferdy Galema
>             Fix For: 2.4
>
>         Attachments: CrawlStatus.java, DbUpdateReducer.java, 
> GeneratorMapper.java, GeneratorReducer.java
>
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to