Nutch is designed, by default to collect up to 100 links on each fetched page 
which then are added to the Nutch DB. You can then fetch these pages and the 
process is repeated.
 
I wouldn't call it "real data" but I would expect you to be able to compete 
with Yahoo or Google's index size if you have the money, hardware and time 
using Nutch. But then again, if you have those three you could probably do just 
about anything.
 
Most of us run out of gigabytes before we don't have anymore URLs to fetch.


----- Original Message ----
From: Carlos González-Cadenas <[EMAIL PROTECTED]>
To: [email protected]
Sent: Thursday, January 11, 2007 1:17:00 PM
Subject: Re: fetch list


Great!!!,

Has someone an estimation / real data about the number of pages that can be
crawled with these roots?.

Some famous blogs related to search point that the indexes of Yahoo or
Google may be of 20-60 billion document. Do you know if these magnitudes can
be obtained starting from the DMOZ roots you propose, or it's needed to add
some other "secret recipe roots" (only known by a few)  to get to that
numbers?

Best regards,

Carlos



On 1/10/07, Sean Dean <[EMAIL PROTECTED]> wrote:
>
> Follow the whole-web crawling tutorial here:
> http://lucene.apache.org/nutch/tutorial8.html#Whole-web+Crawling
>
>
> It will seed your Nutch DB with many "core" Internet sites provided by
> DMOZ. You can then right away create a fetch list with about 4 million or so
> URLs.
>
> This would be the recommended starting point for any whole-web crawl.
>
>
> ----- Original Message ----
> From: Carlos González-Cadenas <[EMAIL PROTECTED]>
> To: [email protected]
> Sent: Wednesday, January 10, 2007 5:53:02 AM
> Subject: fetch list
>
>
> Hi all,
>
> In order to perform a full-web crawl, I suppose that the choosing of the
> roots of the fetch (the fetch list) is critical.
>
> Do you have any hints on how to do that? Or alternatively a "proposed"
> fetch
> list for that purpose?.
>
> Thanks in advance,
>
> Best regards,
>
> Carlos
>
-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to