Dear Sami Siren,

Thank you for your prompt answer, but my problem with 0.8.1 was not with the 
fetching time itself (although your speed in doing so is a lot greater than 
mine), that is on pair with 0.7.2.  My problem is with the time for all the 
post fetching processes, that is much longer than with 0.7.2.  When I 
indexed that million pages, it took me about the weekend (the whole 
process);  when I tried to index 500,000 pages with 0.8.1,  the fetching 
went ok, but, after that, I could not get the job done.  The weekend went by 
and I just could not wait anymore. That`s why I think that, in many cases, 
in using a single machine, 0.7.2 could be a better choice, mainly if this 
version is updated.

Regads

----- Original Message ----- 
From: "Sami Siren" <[EMAIL PROTECTED]>
To: <[email protected]>
Sent: Monday, November 13, 2006 4:28 PM
Subject: Re: Strategic Direction of Nutch


> carmmello wrote:
>> So, I think, one of the possibilities for the user of a single machine is 
>> that the Nutch developers could use some of their time do improve the 
>> previous 0.7.2, adding to it some new features, with further releases of 
>> this series.  I don`t belive that there are many Nutch users, in the real 
>> world of searching, with a farm of computers.  I, for myself, have 
>> already built an index of more than one million pages in a single 
>> machine, with an somewhat old Atlhon 2.4+ and 1 gig of memory, using the 
>> 0.7.2 version, with very good results, including the actual searching, 
>> and gave up the same task, using the 0.8 version, because of the large 
>> amount of time required, time that I did not have,  to complete all the 
>> tasks, after the fetching of the pages.
>
> How fast do you need to go?
>
> I did a 1 million page crawl today with trunk version of nutch patched 
> with NUTCH-395 [1]. total time for fetching was little over 7 hrs.
>
> But of course there are still various ways to optimize fetching process - 
> for example optimizing the scheduling of urls to fetch, improving nutch 
> agent to use Accept header [2] for failing fast on content it cannot 
> handle etc.
>
> [1]http://issues.apache.org/jira/browse/NUTCH-395
> [2]http://www.mail-archive.com/[email protected]/msg04344.html
>
> --
>  Sami Siren
>
>
> -- 
> No virus found in this incoming message.
> Checked by AVG Free Edition.
> Version: 7.1.409 / Virus Database: 268.14.4/532 - Release Date: 13/11/2006
>
> 


-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to