[Nutch-dev] Re: Seeking help in understanding – fetch, refetch & co.

Daniel D. Thu, 09 Jun 2005 07:52:40 -0700

*Andrzej,*
**
*Thanks a lot for your response. *

 
> > 1. I have set db.default.fetch.interval to 1 (in nutch-default.xml)
> > but I have noticed that fetchInterval field in Page object is being set 
> to
> > current time + 7 days while URL link data is being read from the 
> fetchlist.
> > Can somebody explain why or am I not reading the code correctly?
> 
> Yes. This is required so that you can generate many fetchlists in rapid
> succession (for parallel crawling), without getting the same pages in
> many fetchlists. This is sort of equivalent to setting a flag saying
> "this page is already in another fetchlist, wait 7 days before
> attempting to put it in another fetchlist".
> 
After you have fetched a segment, and you update the db, this time is
> re-set to the fetchTime + fetchInterval.
> 
> > 2. I have modified code to ignore fetchInterval value coming from the
> > fetchlist, meaning that fetchInterval stays equal to the initial value -
> > current time. After I do the following commands: fetch, db update
> > and generate
> > db segments, I'm getting new fetchlist but this list doesn't include my
> > original sites. Even so their next fetch time should be in past already. 
> Can
> > somebody help me to understand when those URLS will be fetch?
> 
> It's difficult to say what changes you have made... I suggest sticking
> with the current code until it makes more sense to you... ;-)


  *My objective is to learn how to crawl Web with Nutch: Start with the 
initial set o f URLs and continue discovering new pages and re-fetching 
existing (when needed).*

*This modification was done for test purpose only. I have commented out 
assignment of new value to the fetchInterval in the Page.readFields(). Note 
that I don't have code in this computer and providing function name as I 
remember it. *
*My assumption was that as I have crawled 3 original URLs and have 
discovered some new URLS, I should next time see in the fetchlist my 3 
original URLS + new URLS (based on specified urlfilter-regex ). I wanted to 
see my original URLS re-crawled! I didn't find them in the new fetchlist and 
this was my question – what am I missing here? Why those URLS are not being 
included in the fetchlist even so they fetch time already past?*

> 3. Looks like fetcher fail to extract links from http://www.eltweb.com.
> > I know that there are some formats (looks like some HTML variations 
> also)
> > that are not supported. Where can I find information what is currently
> > supported?
> 
> This site has a content redirect (using HTML meta tags) on its home
> page, and no other content. In Nutch 0.6 this was not supported, you
> need to get the latest SVN version in order to crawl such sites.

  *Thanks, I will get newer version.*

> 4. Some of the out-links discovered during the fetch (for instance:
> > http://www.webct.com/software/viewpage?name=software_campus_edition or
> > http://v.extreme-dm.com/?login=cguilfor ) are being ignored (not
> > included in the next fetchlist after executing [generate db segments]
> > command). Is there known reason for this? Is there some documentation
> > describing supported URL types.
> 
> Outlinks discovered during specific crawl are added to the WebDB (IFF
> they pass the urlfilter-regex), and then if they point to pages that
> pass the urlfilter they are included in the next fetchlist. This is 
> normal.

 *Thanks again, I will learn more about urlfilter-regex* 
 * *
** 

*Regards,*

* *

*Daniel*

[Nutch-dev] Re: Seeking help in understanding – fetch, refetch & co.

Reply via email to