Hi,
Nutch, from version 0.8 is, really, very, very slow, using a single machine, 
to process data, after the crawling.  Compared with Nutch 0.7.2 I would say, 
from my experience in indexing about 500,000 pages  that it is roughly 4 to 
5 times slower.  In adition to that, the possibilities to fix some broken 
segments (if the crawl is interrupted for some reason) are absent.
So, I think, one of the possibilities for the user of a single machine is 
that the Nutch developers could use some of their time do improve the 
previous 0.7.2, adding to it some new features, with further releases of 
this series.  I don`t belive that there are many Nutch users, in the real 
world of searching, with a farm of computers.  I, for myself, have already 
built an index of more than one million pages in a single machine, with an 
somewhat old Atlhon 2.4+ and 1 gig of memory, using the 0.7.2 version, with 
very good results, including the actual searching,  and gave up the same 
task, using the 0.8 version, because of the large amount of time required, 
time that I did not have,  to complete all the tasks, after the fetching of 
the pages.
Thanks,
Wilson Melo



----- Original Message ----- 
From: "Andrzej Bialecki" <[EMAIL PROTECTED]>
To: <[email protected]>
Sent: Monday, November 13, 2006 7:32 AM
Subject: Re: Strategic Direction of Nutch


> Nutch Newbie wrote:
>> Well, I would like to agree with Piotr here but current development i.e. 
>> 0.8
>> version and onwards single machine nutch install is not optimal there
>> are various
>> hadoop related issue example
>>
>> http://issues.apache.org/jira/browse/HADOOP-206
>
> Is it really still a valid issue? I'm pretty sure this was already fixed, 
> or perhaps it was a matter of putting hard limits in hadoop-site.xml 
> (which overrides even job.xml values).
>
>
>> The problem regarding 0.8 being slow on single machine is nothing new
>> just search the
>> mailing list you will find many example for it. 0.8 was released
>> earlier this year and the
>> problem is still not solved so I am sorry to be negative but I am just
>> stating facts.
>
> What Nutch needs at this moment is more developers and contributors. This 
> and similar issues might be solved by directly addressing each problem, if 
> we had human resources to do so. As it is now, there are few active Nutch 
> developers at the moment, and issues are being addressed slower than we 
> would wish it.
>
> (BTW, Chris Mattmann will be joining the committers group, so you can 
> expect some improvements in this regard).
>
> But what Piotr stated is that use cases such as yours _are_ important to 
> us, and this problem will be fixed sooner or later, whenever we have free 
> resources to do it. If you can help us with debugging and testing, and 
> providing patches, this process will be much quicker.
>
> I suspect that we (Nutch community) are the only serious user of Hadoop in 
> local mode - most development efforts in Hadoop project are geared towards 
> supporting massive clusters and not single machines. So, I would say it's 
> up to us - the Nutch community - to provide sufficient feedback to Hadoop 
> to have such issues addressed.
>
> -- 
> Best regards,
> Andrzej Bialecki     <><
> ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>
>
>
> -- 
> No virus found in this incoming message.
> Checked by AVG Free Edition.
> Version: 7.1.409 / Virus Database: 268.14.4/532 - Release Date: 13/11/2006
>
> 


-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to