RE: Need some directions

Markus Jelsma Fri, 31 Aug 2012 06:48:48 -0700

 
-----Original message-----
> From:Vijith <vijithkv...@gmail.com>
> Sent: Fri 31-Aug-2012 15:44
> To: dev@nutch.apache.org
> Subject: Re: Need some directions
> 
> I have tried running nutch with a sample site with two different urls 
> redirecting to a common resource.
> I could not find any clues, from hadoop.log, where the common resource is 
> parsed multiple times.
> Could some one please explain the exact scenario that creates this bug.

In the Jira comment you said it fetched page4 twice now.

> 
> And how does this bug relates to NUTCH-1184 ? 

It relates to 1184 because if URL's in the same fetch list link to a common 
page, it can be followed.as well.

We solved this issue by keeping a list of crawled URL's in a external bloom 
filter.

> 
> On Thu, Aug 30, 2012 at 11:44 AM, Vijith <vijithkv...@gmail.com 
> <mailto:vijithkv...@gmail.com> > wrote:
> Hi all, 
> 
> I am new to dev... I am working on NUTCH-1150...
> I would like to get some directions before I can start... Right now I am 
> going through the Fetcher.java code...
> 
> -- 
> . . . . . thanks & regards
> 
> Vijith V.
> 
> 
> 
> 
> 
> -- 
> . . . . . thanks & regards
> 
> Vijith V.
> 
> 
>

RE: Need some directions

Reply via email to