Daniel D. wrote:
Hello,

I'm trying to understand how to start with initial set of URLs and continue fetching new URLS and re-fetching existing URLS (when they due to re-fetch).

I have run some tests in order to understand and test the software behavior. Now I have some questions for you guys and seeking your help.

1. I have set db.default.fetch.interval to 1 (in nutch-default.xml) but I have noticed that fetchInterval field in Page object is being set to current time + 7 days while URL link data is being read from the fetchlist. Can somebody explain why or am I not reading the code correctly?

Yes. This is required so that you can generate many fetchlists in rapid succession (for parallel crawling), without getting the same pages in many fetchlists. This is sort of equivalent to setting a flag saying "this page is already in another fetchlist, wait 7 days before attempting to put it in another fetchlist".

After you have fetched a segment, and you update the db, this time is re-set to the fetchTime + fetchInterval.

2. I have modified code to ignore fetchInterval value coming from the fetchlist, meaning that fetchInterval stays equal to the initial value - current time. After I do the following commands: fetch, db update
and generate
db segments, I'm getting new fetchlist but this list doesn't include my original sites. Even so their next fetch time should be in past already. Can somebody help me to understand when those URLS will be fetch?

It's difficult to say what changes you have made... I suggest sticking with the current code until it makes more sense to you... ;-)

3. Looks like fetcher fail to extract links from http://www.eltweb.com. I know that there are some formats (looks like some HTML variations also) that are not supported. Where can I find information what is currently supported?

This site has a content redirect (using HTML meta tags) on its home page, and no other content. In Nutch 0.6 this was not supported, you need to get the latest SVN version in order to crawl such sites.

4. Some of the out-links discovered during the fetch (for instance: http://www.webct.com/software/viewpage?name=software_campus_edition or http://v.extreme-dm.com/?login=cguilfor ) are being ignored (not included in the next fetchlist after executing [generate db segments] command). Is there known reason for this? Is there some documentation describing supported URL types.

Outlinks discovered during specific crawl are added to the WebDB (IFF they pass the urlfilter-regex), and then if they point to pages that pass the urlfilter they are included in the next fetchlist. This is normal.


I'm still new to this software and tried to explain what I did and hope this was clear enough, but I'm not sure I have asked the right questions.

Some of this information is explained better on Nutch Wiki.

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



-------------------------------------------------------
This SF.Net email is sponsored by: NEC IT Guy Games.  How far can you shotput
a projector? How fast can you ride your desk chair down the office luge track?
If you want to score the big prize, get to know the little guy. Play to win an NEC 61" plasma display: http://www.necitguy.com/?r=20
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to