Re: [Nutch-general] Strategic Direction of Nutch

Sami Siren Mon, 13 Nov 2006 10:28:59 -0800

carmmello wrote:
> So, I think, one of the possibilities for the user of a single machine 
> is that the Nutch developers could use some of their time do improve the 
> previous 0.7.2, adding to it some new features, with further releases of 
> this series.  I don`t belive that there are many Nutch users, in the 
> real world of searching, with a farm of computers.  I, for myself, have 
> already built an index of more than one million pages in a single 
> machine, with an somewhat old Atlhon 2.4+ and 1 gig of memory, using the 
> 0.7.2 version, with very good results, including the actual searching,  
> and gave up the same task, using the 0.8 version, because of the large 
> amount of time required, time that I did not have,  to complete all the 
> tasks, after the fetching of the pages.


How fast do you need to go?

I did a 1 million page crawl today with trunk version of nutch patched 
with NUTCH-395 [1]. total time for fetching was little over 7 hrs.

But of course there are still various ways to optimize fetching process 
- for example optimizing the scheduling of urls to fetch, improving 
nutch agent to use Accept header [2] for failing fast on content it 
cannot handle etc.

[1]http://issues.apache.org/jira/browse/NUTCH-395
[2]http://www.mail-archive.com/[email protected]/msg04344.html

--
  Sami Siren

-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Re: [Nutch-general] Strategic Direction of Nutch

Reply via email to