Re: [Nutch-general] map-reduce very slow on single machine

Zaheed Haque Thu, 26 Oct 2006 12:07:36 -0700

On 10/26/06, AJ Chen <[EMAIL PROTECTED]> wrote:
> Current version of nutch uses mapred regardless of the number of computer
> nodes. So, for applications using a single computer and default
> configuration (i.e. no distributed crawling), the issue is not about
> performance gain from mapred, but rather how to minimize the overhead from
> mapred. Does anyone have a good performance benchmark for running nutch 0.8or
> 0.9 on a single machine? Particularly, how much time spent in map-reduce
> phases is reasonable relative to the time used by fetching phase?
>
> Can anyone tell whether 4 hours of doing "reduce > reduce" after fetching
> 100,000 pages in 5 hours is within the expectation? If it's not right, what
> might be the cause?


How much memory do you have? Also whats the java heap size? I have the
following configuration on my test server

6 GB memory
one 64 AMD
Java heap 4 GB
and I can do 250,000 pages.. crawl to index in about 2-3 hours.

also I am just fetching strict html pages ..no pdf no word etc..

Sorry its very difficult to say what could be the problem.

Regards,
Zaheed


> AJ
>
>
> On 10/26/06, Josef Novak <[EMAIL PROTECTED]> wrote:
> >
> > Hi AJ,
> >
> > I very well may be wrong, but as I understand it, nutch/hadoop
> > implements map/reduce primarily as a means of efficiently and reliably
> > distributing work among nodes in a (large) cluster of consumer grade
> > machines.  I suspect that there is not much to be gained from
> > implementing it with a single machine.
> >
> > http://labs.google.com/papers/mapreduce.html
> > http://wiki.apache.org/lucene-hadoop/HadoopMapReduce
> > http://wiki.apache.org/lucene-hadoop/
> >
> >
> > happy hunting,
> > joe
> >
> >
> > On 10/27/06, AJ Chen <[EMAIL PROTECTED]> wrote:
> > > I'm using 0.9-dev code to crawl the web on a single machine. Using
> > default
> > > configuration, it spends ~5 hours to fetch 100,000 pages, but also >5
> > hours
> > > in doing map-reduce. Is this the expected performance for map-reduce
> > phase
> > > relative to fetch phase? It seems to me map-reduce takes too much time.
> > Is
> > > there anything to configure in order to reduce the operation (time) for
> > > map-reduce?  I'll appreciate any suggetion on how to improve web search
> > > performance on single machine.
> > >
> > > Thanks,
> > >
> > > AJ
> > > http://web2express.org
> > >
> > >
> >
>
>
>
> --
> AJ Chen, PhD
> http://web2express.org
>
>

-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Re: [Nutch-general] map-reduce very slow on single machine

Reply via email to