Re: [Nutch-general] Best performance approach for single MP machine?

Håvard W. Kongsgård Thu, 20 Jul 2006 12:37:45 -0700

 
http://www.mail-archive.com/[email protected]/msg02394.html


"
Teruhiko Kurosaka wrote:

    Can I use MapReduce to run Nutch on a multi CPU system?
      

Yes.


    I want to run the index job on two (or four) CPUs
    on a single system.  I'm not trying to distribute the job
    over multiple systems.

    If the MapReduce is the way to go,
    do I just specify config parameters like these:
    mapred.tasktracker.tasks.maxiumum=2
    mapred.job.tracker=localhost:9001
    mapred.reduce.tasks=2 (or 1?)

    and
    bin/start-all.sh

    ?
      

That should work. You'd probably want to set the default number of map 
tasks to be a multiple of the number of CPUs, and the number of reduce 
tasks to be exactly the number of cpus.

Don't use start-all.sh, but rather just:

bin/nutch-daemon.sh start tasktracker
bin/nutch-daemon.sh start jobtracker


    Must I use NDFS for MapReduce?
      

No.

Doug

"





Doug Cook wrote:
> Hi,
>
> I've recently switched to 0.8 from 0.7, and after some initial fits and
> starts, I'm past the "get it working at all" stage to the "get reasonable
> performance" stage.
>
> I've got a single machine with 4 CPUs and a lot of memory. URL fetching
> works great because it's (mostly) multithreaded. But as soon as I hit the
> reduce phase of fetch, it's dog slow. I'm down to running on one CPU, and
> the phase can take days, leaving me vulnerable to losing everything should a
> process fail.
>
> Wait! you say. That's just what Hadoop is for! I'm all ears. I'd love some
> help getting my configuration right. I've seen examples/tutorials of
> configurations for multiple machines; am I just "faking" multiple machines
> on my single node (will that work?) or is there a cleaner, simpler approach?
>
> Alternatively, I was all excited to get an easy improvement with
> -numFetchers, and run 4 fetchers simultaneously to use all my CPUs, but it
> looks like -numFetchers has gone away, and though there was an 0.8 version
> patch, at a quick glance this didn't seem to have made it into the mainline
> source, and I don't see the value of trying to merge this in if there's a
> cleaner Hadoop-based approach.
>
> Many thanks for any help.
>
> Doug
>   


-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys -- and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Re: [Nutch-general] Best performance approach for single MP machine?

Reply via email to