[Nutch-dev] Re: Seeking help in understanding – fetch, refetch & co.

Andrzej Bialecki Thu, 09 Jun 2005 02:18:28 -0700

Daniel D. wrote:

Hello,
I'm trying to understand how to start with initial set of URLs and continuefetching new URLS and re-fetching existing URLS (when they due to re-fetch).
I have run some tests in order to understand and test the software behavior.Now I have some questions for you guys and seeking your help.
1. I have set db.default.fetch.interval to 1 (in nutch-default.xml)but I have noticed that fetchInterval field in Page object is being set tocurrent time + 7 days while URL link data is being read from the fetchlist.Can somebody explain why or am I not reading the code correctly?

Yes. This is required so that you can generate many fetchlists in rapidsuccession (for parallel crawling), without getting the same pages inmany fetchlists. This is sort of equivalent to setting a flag saying"this page is already in another fetchlist, wait 7 days beforeattempting to put it in another fetchlist".

After you have fetched a segment, and you update the db, this time isre-set to the fetchTime + fetchInterval.

2. I have modified code to ignore fetchInterval value coming from thefetchlist, meaning that fetchInterval stays equal to the initial value -current time. After I do the following commands: fetch, db update
and generate
db segments, I'm getting new fetchlist but this list doesn't include myoriginal sites. Even so their next fetch time should be in past already. Cansomebody help me to understand when those URLS will be fetch?

It's difficult to say what changes you have made... I suggest stickingwith the current code until it makes more sense to you... ;-)

3. Looks like fetcher fail to extract links from http://www.eltweb.com.I know that there are some formats (looks like some HTML variations also)that are not supported. Where can I find information what is currentlysupported?

This site has a content redirect (using HTML meta tags) on its homepage, and no other content. In Nutch 0.6 this was not supported, youneed to get the latest SVN version in order to crawl such sites.

4. Some of the out-links discovered during the fetch (for instance:http://www.webct.com/software/viewpage?name=software_campus_edition orhttp://v.extreme-dm.com/?login=cguilfor ) are being ignored (notincluded in the next fetchlist after executing [generate db segments]command). Is there known reason for this? Is there some documentationdescribing supported URL types.

Outlinks discovered during specific crawl are added to the WebDB (IFFthey pass the urlfilter-regex), and then if they point to pages thatpass the urlfilter they are included in the next fetchlist. This is normal.

I'm still new to this software and tried to explain what I did and hopethis was clear enough, but I'm not sure I have asked the right questions.


Some of this information is explained better on Nutch Wiki.

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



-------------------------------------------------------
This SF.Net email is sponsored by: NEC IT Guy Games.  How far can you shotput
a projector? How fast can you ride your desk chair down the office luge track?

If you want to score the big prize, get to know the little guy.Play to win an NEC 61" plasma display: http://www.necitguy.com/?r=20

_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

[Nutch-dev] Re: Seeking help in understanding – fetch, refetch & co.

Reply via email to