Re: Best practices for jobs with large Map output

2011-04-15 Thread Shai Erera
bq. If you can change your job to handle metadata backed by a store in HDFS I have two Mappers, one that works with HDFS and one with GPFS. The GPFS one does exactly that -- it stores the indexes in GPFS (which all Mappers and Reducers see, as a shared location) and outputs just the pointer to tha

Re: successive mappers

2011-04-15 Thread Injun Joe
"that some map instances may not require further processing. So if I try to do everything in a single mapper instance" should be read as "that some key value pairs may not require further processing. So if I try to do everything in a single mapper ". From: In

Re: successive mappers

2011-04-15 Thread Injun Joe
The problem with doing all of them in a single mapper is that some map instances may not require further processing. So if I try to do everything in a single mapper instance, I will have a lot of cpus lying idle while others take the load. From: Robert Evans

Re: successive mappers

2011-04-15 Thread Robert Evans
I, Take a look at the Multiple output format classes http://hadoop.apache.org/common/docs/r0.20.0/api/org/apache/hadoop/mapred/lib/MultipleTextOutputFormat.html Is a good example. You should be able to create a custom output format class that matches your needs. Although, if all you are doing

successive mappers

2011-04-15 Thread Injun Joe
Hi, I am coding a map-reduce program which involves several map-reduce steps. The work that my program does is only in the mapper, so I was thinking to have no reduce steps but successive mappers. The logic can be written like this for mappers at iteration 0 and 1: 1. Take input. 2. Map 0: D

Re: Best practices for jobs with large Map output

2011-04-15 Thread Chris Douglas
On Fri, Apr 15, 2011 at 9:34 AM, Harsh J wrote: >> Is there absolutely no way to bypass the shuffle + sort phases? I don't mind >> writing some classes if that's what it takes ... > > Shuffle is an essential part of the Map to Reduce transition, it can't > be 'bypassed' since a Reducer has to fetc

Re: reduce copy rate

2011-04-15 Thread Harsh J
Hello Juwei, On Fri, Apr 15, 2011 at 10:43 PM, Juwei Shi wrote: > Harsh, > > Do you know why reducers start one by one with serveral seconds' interval? > They do not start at the same time. For example, if we set the reduce task > capacity (max concurrent reduce tasks) to be 100, and the average

Re: reduce copy rate

2011-04-15 Thread Juwei Shi
Harsh, Do you know why reducers start one by one with serveral seconds' interval? They do not start at the same time. For example, if we set the reduce task capacity (max concurrent reduce tasks) to be 100, and the average run time of a reduce task is 15 second. Althrough all map tasks are complet

Re: reduce copy rate

2011-04-15 Thread Harsh J
Hello Baran, On Fri, Apr 15, 2011 at 8:19 PM, baran cakici wrote: > Hi, > > I have a question about copy speed by a MapReduce Job.I have a Cluster with > 4 slave and 1 master, computers connected each other with one 8-Port-Switch > (up to 1000Mbps).  Copy speed is by my Job 1,6 - 1,8MB.  Is it no

Re: Best practices for jobs with large Map output

2011-04-15 Thread Harsh J
Hello Shai, On Fri, Apr 15, 2011 at 5:45 PM, Shai Erera wrote: > The job is an indexing job. Each Mapper emits a small index and the Reducer > merges all of those indexes together. The Mappers output the index as a > Writable which serializes it. I guess I could write the Reducer's function > as

Re: reduce copy rate

2011-04-15 Thread Juwei Shi
The actual copy speed depends on the speed of map output. I guess 1.8 MB is the speed of your map task completion. 2011/4/15 baran cakici > Hi, > > I have a question about copy speed by a MapReduce Job.I have a Cluster with > 4 slave and 1 master, computers connected each other with one 8-Port-S

reduce copy rate

2011-04-15 Thread baran cakici
Hi, I have a question about copy speed by a MapReduce Job.I have a Cluster with 4 slave and 1 master, computers connected each other with one 8-Port-Switch (up to 1000Mbps). Copy speed is by my Job 1,6 - 1,8MB. Is it not too slow? Regards, Baran

Re: Small linux distros to run hadoop ?

2011-04-15 Thread Niels Basjes
Hi, 2011/4/15 web service : >    what is the smallest linux system/distros to run hadoop ? I would want to > run small linux vms and run jobs on them. I usually use a fully stripped CentOS 5 to run cluster nodes. Works perfectly and can be fully automated using the kickstart scripting for anacond

Re: Best practices for jobs with large Map output

2011-04-15 Thread Shai Erera
Thanks for the prompt response Harsh ! The job is an indexing job. Each Mapper emits a small index and the Reducer merges all of those indexes together. The Mappers output the index as a Writable which serializes it. I guess I could write the Reducer's function as a separate class as you suggest,