On 10/26/06, AJ Chen <[EMAIL PROTECTED]> wrote: > Current version of nutch uses mapred regardless of the number of computer > nodes. So, for applications using a single computer and default > configuration (i.e. no distributed crawling), the issue is not about > performance gain from mapred, but rather how to minimize the overhead from > mapred. Does anyone have a good performance benchmark for running nutch 0.8or > 0.9 on a single machine? Particularly, how much time spent in map-reduce > phases is reasonable relative to the time used by fetching phase? > > Can anyone tell whether 4 hours of doing "reduce > reduce" after fetching > 100,000 pages in 5 hours is within the expectation? If it's not right, what > might be the cause?
How much memory do you have? Also whats the java heap size? I have the following configuration on my test server 6 GB memory one 64 AMD Java heap 4 GB and I can do 250,000 pages.. crawl to index in about 2-3 hours. also I am just fetching strict html pages ..no pdf no word etc.. Sorry its very difficult to say what could be the problem. Regards, Zaheed > AJ > > > On 10/26/06, Josef Novak <[EMAIL PROTECTED]> wrote: > > > > Hi AJ, > > > > I very well may be wrong, but as I understand it, nutch/hadoop > > implements map/reduce primarily as a means of efficiently and reliably > > distributing work among nodes in a (large) cluster of consumer grade > > machines. I suspect that there is not much to be gained from > > implementing it with a single machine. > > > > http://labs.google.com/papers/mapreduce.html > > http://wiki.apache.org/lucene-hadoop/HadoopMapReduce > > http://wiki.apache.org/lucene-hadoop/ > > > > > > happy hunting, > > joe > > > > > > On 10/27/06, AJ Chen <[EMAIL PROTECTED]> wrote: > > > I'm using 0.9-dev code to crawl the web on a single machine. Using > > default > > > configuration, it spends ~5 hours to fetch 100,000 pages, but also >5 > > hours > > > in doing map-reduce. Is this the expected performance for map-reduce > > phase > > > relative to fetch phase? It seems to me map-reduce takes too much time. > > Is > > > there anything to configure in order to reduce the operation (time) for > > > map-reduce? I'll appreciate any suggetion on how to improve web search > > > performance on single machine. > > > > > > Thanks, > > > > > > AJ > > > http://web2express.org > > > > > > > > > > > > -- > AJ Chen, PhD > http://web2express.org > > ------------------------------------------------------------------------- Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
