Dear Sami Siren, Thank you for your prompt answer, but my problem with 0.8.1 was not with the fetching time itself (although your speed in doing so is a lot greater than mine), that is on pair with 0.7.2. My problem is with the time for all the post fetching processes, that is much longer than with 0.7.2. When I indexed that million pages, it took me about the weekend (the whole process); when I tried to index 500,000 pages with 0.8.1, the fetching went ok, but, after that, I could not get the job done. The weekend went by and I just could not wait anymore. That`s why I think that, in many cases, in using a single machine, 0.7.2 could be a better choice, mainly if this version is updated.
Regads ----- Original Message ----- From: "Sami Siren" <[EMAIL PROTECTED]> To: <[email protected]> Sent: Monday, November 13, 2006 4:28 PM Subject: Re: Strategic Direction of Nutch > carmmello wrote: >> So, I think, one of the possibilities for the user of a single machine is >> that the Nutch developers could use some of their time do improve the >> previous 0.7.2, adding to it some new features, with further releases of >> this series. I don`t belive that there are many Nutch users, in the real >> world of searching, with a farm of computers. I, for myself, have >> already built an index of more than one million pages in a single >> machine, with an somewhat old Atlhon 2.4+ and 1 gig of memory, using the >> 0.7.2 version, with very good results, including the actual searching, >> and gave up the same task, using the 0.8 version, because of the large >> amount of time required, time that I did not have, to complete all the >> tasks, after the fetching of the pages. > > How fast do you need to go? > > I did a 1 million page crawl today with trunk version of nutch patched > with NUTCH-395 [1]. total time for fetching was little over 7 hrs. > > But of course there are still various ways to optimize fetching process - > for example optimizing the scheduling of urls to fetch, improving nutch > agent to use Accept header [2] for failing fast on content it cannot > handle etc. > > [1]http://issues.apache.org/jira/browse/NUTCH-395 > [2]http://www.mail-archive.com/[email protected]/msg04344.html > > -- > Sami Siren > > > -- > No virus found in this incoming message. > Checked by AVG Free Edition. > Version: 7.1.409 / Virus Database: 268.14.4/532 - Release Date: 13/11/2006 > > ------------------------------------------------------------------------- Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
