-----Original message----- > From:Vijith <vijithkv...@gmail.com> > Sent: Fri 31-Aug-2012 15:44 > To: dev@nutch.apache.org > Subject: Re: Need some directions > > I have tried running nutch with a sample site with two different urls > redirecting to a common resource. > I could not find any clues, from hadoop.log, where the common resource is > parsed multiple times. > Could some one please explain the exact scenario that creates this bug.
In the Jira comment you said it fetched page4 twice now. > > And how does this bug relates to NUTCH-1184 ? It relates to 1184 because if URL's in the same fetch list link to a common page, it can be followed.as well. We solved this issue by keeping a list of crawled URL's in a external bloom filter. > > On Thu, Aug 30, 2012 at 11:44 AM, Vijith <vijithkv...@gmail.com > <mailto:vijithkv...@gmail.com> > wrote: > Hi all, > > I am new to dev... I am working on NUTCH-1150... > I would like to get some directions before I can start... Right now I am > going through the Fetcher.java code... > > -- > . . . . . thanks & regards > > Vijith V. > > > > > > -- > . . . . . thanks & regards > > Vijith V. > > >