*Andrzej,* ** *Thanks a lot for your response. * > > 1. I have set db.default.fetch.interval to 1 (in nutch-default.xml) > > but I have noticed that fetchInterval field in Page object is being set > to > > current time + 7 days while URL link data is being read from the > fetchlist. > > Can somebody explain why or am I not reading the code correctly? > > Yes. This is required so that you can generate many fetchlists in rapid > succession (for parallel crawling), without getting the same pages in > many fetchlists. This is sort of equivalent to setting a flag saying > "this page is already in another fetchlist, wait 7 days before > attempting to put it in another fetchlist". > After you have fetched a segment, and you update the db, this time is > re-set to the fetchTime + fetchInterval. > > > 2. I have modified code to ignore fetchInterval value coming from the > > fetchlist, meaning that fetchInterval stays equal to the initial value - > > current time. After I do the following commands: fetch, db update > > and generate > > db segments, I'm getting new fetchlist but this list doesn't include my > > original sites. Even so their next fetch time should be in past already. > Can > > somebody help me to understand when those URLS will be fetch? > > It's difficult to say what changes you have made... I suggest sticking > with the current code until it makes more sense to you... ;-)
*My objective is to learn how to crawl Web with Nutch: Start with the initial set o f URLs and continue discovering new pages and re-fetching existing (when needed).* *This modification was done for test purpose only. I have commented out assignment of new value to the fetchInterval in the Page.readFields(). Note that I don't have code in this computer and providing function name as I remember it. * *My assumption was that as I have crawled 3 original URLs and have discovered some new URLS, I should next time see in the fetchlist my 3 original URLS + new URLS (based on specified urlfilter-regex ). I wanted to see my original URLS re-crawled! I didn't find them in the new fetchlist and this was my question – what am I missing here? Why those URLS are not being included in the fetchlist even so they fetch time already past?* > 3. Looks like fetcher fail to extract links from http://www.eltweb.com. > > I know that there are some formats (looks like some HTML variations > also) > > that are not supported. Where can I find information what is currently > > supported? > > This site has a content redirect (using HTML meta tags) on its home > page, and no other content. In Nutch 0.6 this was not supported, you > need to get the latest SVN version in order to crawl such sites. *Thanks, I will get newer version.* > 4. Some of the out-links discovered during the fetch (for instance: > > http://www.webct.com/software/viewpage?name=software_campus_edition or > > http://v.extreme-dm.com/?login=cguilfor ) are being ignored (not > > included in the next fetchlist after executing [generate db segments] > > command). Is there known reason for this? Is there some documentation > > describing supported URL types. > > Outlinks discovered during specific crawl are added to the WebDB (IFF > they pass the urlfilter-regex), and then if they point to pages that > pass the urlfilter they are included in the next fetchlist. This is > normal. *Thanks again, I will learn more about urlfilter-regex* * * ** *Regards,* * * *Daniel*
