Re: [Nutch-general] Following outlinks during - or after - seed fetch

Sean Dean Tue, 06 Mar 2007 22:13:36 -0800

The default interval for fetched content is 30 days, so whats in your index now 
will not be fetched until those days have passed.
 
All the new links are ready to be fetched immediately. Just create another 
segment from the same Nutch DB and it will include all of those new links to be 
fetched.
 
You might want to run some stats on your Nutch DB before you do this, or at 
least limit the size of the new segment being created. Depending on the size of 
your first segment and the amount of links on those pages you might have 
imported "a lot" more links then your expecting.
 
Stats command: 
 
bin/nutch readdb crawl/crawldb -stats

Limiting segment size:

bin/nutch generate crawl/crawldb crawl/segments -topN [maximum amount of links]

----- Original Message ----
From: Ricardo J. Méndez <[EMAIL PROTECTED]>
To: [email protected]
Sent: Wednesday, March 7, 2007 12:16:54 AM
Subject: Following outlinks during - or after - seed fetch

Hi,

I've written a plugin and have been running some tests with Nutch, based
on the tutorials on the wiki (specifically
http://wiki.apache.org/nutch/NutchTutorial ).  I'm seeding the crawl
list with a limited item list, so that I can verify the items are being
loaded.

After the end of the fetch, the index is correctly populated with the
items I told it to fetch.   How can I start a crawl from the outlinks on
the items I've seeded?

Thanks in advance,

Ricardo J. Méndez
http://ricardo.strangevistas.net/

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV

_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Re: [Nutch-general] Following outlinks during - or after - seed fetch

Reply via email to