Re: [Nutch-general] fetch list

Carlos González-Cadenas Thu, 11 Jan 2007 10:17:35 -0800

Great!!!,

Has someone an estimation / real data about the number of pages that can be
crawled with these roots?.


Some famous blogs related to search point that the indexes of Yahoo or
Google may be of 20-60 billion document. Do you know if these magnitudes can
be obtained starting from the DMOZ roots you propose, or it's needed to add
some other "secret recipe roots" (only known by a few)  to get to that
numbers?

Best regards,

Carlos



On 1/10/07, Sean Dean <[EMAIL PROTECTED]> wrote:


Follow the whole-web crawling tutorial here:
http://lucene.apache.org/nutch/tutorial8.html#Whole-web+Crawling


It will seed your Nutch DB with many "core" Internet sites provided by
DMOZ. You can then right away create a fetch list with about 4 million or so
URLs.

This would be the recommended starting point for any whole-web crawl.


----- Original Message ----
From: Carlos González-Cadenas <[EMAIL PROTECTED]>
To: [email protected]
Sent: Wednesday, January 10, 2007 5:53:02 AM
Subject: fetch list


Hi all,

In order to perform a full-web crawl, I suppose that the choosing of the
roots of the fetch (the fetch list) is critical.

Do you have any hints on how to do that? Or alternatively a "proposed"
fetch
list for that purpose?.

Thanks in advance,

Best regards,

Carlos

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV

_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Re: [Nutch-general] fetch list

Reply via email to