[jira] [Comment Edited] (NUTCH-1679) UpdateDb using batchId, link may override crawled page.

Alexander Kingson (JIRA) Mon, 21 Jul 2014 17:47:19 -0700

    [ 
https://issues.apache.org/jira/browse/NUTCH-1679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14069567#comment-14069567
 ]


Alexander Kingson edited comment on NUTCH-1679 at 7/22/14 12:46 AM:
--------------------------------------------------------------------

Hi,

I was suggesting to close the datastore by adding this function 

protected void cleanup(Context context) throws IOException,
        InterruptedException {
        store.close();
    }

to reducer class.

Also, I found  another issue, when inlinks data is wiped out in the next 
updatedb stages. To solve this issue I replaced these lines

if (page.getInlinks() != null) {
      page.getInlinks().clear();
    }

with
if (page.getInlinks() != null&& !inlinkedScoreData.isEmpty()) {
      page.getInlinks().clear();
    }

This code change is tested only in one case, and I am not sure if it solves  
the issue permanently.  Basically, if not clearing inlinks data in each call to 
reduce function  does not cause overlap of inlinks data between keys then this 
code change solves the issue.

Thanks.
Alex.


was (Author: alxksn):
Hi,

I was suggesting to close the datastore by adding this function 

protected void cleanup(Context context) throws IOException,
        InterruptedException {
        store.close();
    }

to reducer class.

Also, I found  another issue, when inlink data is wiped out in the next 
updatedb stages. To solve this issue I replaced these lines

if (page.getInlinks() != null) {
      page.getInlinks().clear();
    }

with
if (page.getInlinks() != null&& !inlinkedScoreData.isEmpty()) {
      page.getInlinks().clear();
    }

This code change is tested only in one case, and I am not sure if it solves  
the issue permanently. 

Thanks.
Alex.

> UpdateDb using batchId, link may override crawled page.
> -------------------------------------------------------
>
>                 Key: NUTCH-1679
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1679
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 2.2.1
>            Reporter: Tien Nguyen Manh
>            Priority: Critical
>             Fix For: 2.3
>
>         Attachments: NUTCH-1679.patch
>
>
> The problem is in Hbase store, not sure about other store.
> Suppose at first crawl cycle we crawl link A, then get an outlink B.
> In second cycle we crawl link B which also has a link point to A
> In second updatedb we load only page B from store, and will add A as new link 
> because it doesn't know A already exist in store and will override A.
> UpdateDb must be run without batchId or we must set additionsAllowed=false
> Here are code for new page
>       page = new WebPage();
>       schedule.initializeSchedule(url, page);
>       page.setStatus(CrawlStatus.STATUS_UNFETCHED);
>       try {
>         scoringFilters.initialScore(url, page);
>       } catch (ScoringFilterException e) {
>         page.setScore(0.0f);
>       }
> new page will override old page status, score, fetchTime, fetchInterval, 
> retries, metadata[CASH_KEY]
>  - i think we can change something here so that new page will only update one 
> column for example 'link' and if it is really a new page, we can initialize 
> all above fields in generator
> - or we add operator checkAndPut to store so when add new page we will check 
> if already exist first



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Comment Edited] (NUTCH-1679) UpdateDb using batchId, link may override crawled page.

Reply via email to