You can't just set the block size, you need to modify the InputFormat to change the number of splits. For example, you can do:
FileInputFormat.setMaxInputSplitSize(job, maxSizeInBytes); and you'll force it to make more splits in your data set, and hence more mappers. -jake On Tue, Sep 6, 2011 at 4:12 PM, Dhruv Kumar <dku...@ecs.umass.edu> wrote: > On Tue, Sep 6, 2011 at 6:57 PM, Chris Lu <c...@atypon.com> wrote: > > > Thanks. Very helpful to me! > > > > I tried to change the setting of "mapred.map.tasks". However, the number > > map task is still just one on one of the 20 machines. > > > > ./elastic-mapreduce --create --alive \ > > --num-instances 20 --name "LDA" \ > > --bootstrap-action > s3://elasticmapreduce/**bootstrap-actions/configure-* > > *hadoop \ > > --bootstrap-name "Configuring number of map tasks per job" \ > > --args "-m,mapred.map.tasks=40" > > > > Anyone knows how to configure the number of mappers? > > Again, the input size is only 46M. > > > > Chris > > > > > > On 09/06/2011 12:09 PM, Ted Dunning wrote: > > > >> Well, I think that using small instances is a disaster in general. The > >> performance that you get from them can vary easily by an order of > >> magnitude. > >> My own preference for real work is either m2xl or cc14xl. The latter > >> machines give you nearly bare metal performance and no noisy neighbors. > >> The > >> m2xl is typically very much underpriced on the spot market. > >> > >> Sean is right about your job being misconfigured. The Hadoop overhead > is > >> considerable and you have only given it two threads to overcome that > >> overhead. > >> > >> On Tue, Sep 6, 2011 at 6:12 PM, Sean Owen<sro...@gmail.com> wrote: > >> > >> That's your biggest issue, certainly. Only 2 mappers are running, even > >>> though you have 20 machines available. Hadoop determines the number of > >>> mappers based on input size, and your input isn't so big that it thinks > >>> you > >>> need 20 workers. It's launching 33 reducers, so your cluster is put to > >>> use > >>> there. But it's no wonder you're not seeing anything like 20x speedup > in > >>> the > >>> mapper. > >>> > >>> You can of course force it to use more mappers, and that's probably a > >>> good > >>> idea here. -Dmapred.map.tasks=20 perhaps. More mappers means more > >>> overhead > >>> of spinning up mappers to process less data, and Hadoop's guess > indicates > >>> that it thinks it's not efficient to use 20 workers. If you know that > >>> those > >>> other 18 are otherwise idle, my guess is you'd benefit from just making > >>> it > >>> use 20. > >>> > >> > > Sean, > > I too have always been confused about how Hadoop decides to set the number > of mappers so you could help my understanding here... > > Is -Dmapred.map.tasks just a hint to the framework for the number of > mappers > (just like using the combiner is a hint) or does it actually set the number > of workers to that number (provided our input is large enough)? > > The reason I ask is because on > http://wiki.apache.org/hadoop/HowManyMapsAndReduces, it is mentioned that > the framework uses the HDFS block size to decide on the number of mapper > workers to be invoked. Should we be setting that parameter instead? > > > > > >>> If this were a general large cluster where many people are taking > >>> advantage > >>> of the workers, then I'd trust Hadoop's guesses until you are sure you > >>> want > >>> to do otherwise. > >>> > >>> On Tue, Sep 6, 2011 at 7:02 PM, Chris Lu<c...@atypon.com> wrote: > >>> > >>> Thanks for all the suggestions! > >>>> > >>>> All the inputs are the same. It takes 85 hours for 4 iterations on 20 > >>>> Amazon small machines. On my local single node, it got to iteration 19 > >>>> > >>> for > >>> > >>>> also 85 hours. > >>>> > >>>> Here is a section of the Amazon log output. > >>>> It covers the start of iteration 1, and between iteration 4 and > >>>> iteration > >>>> 5. > >>>> > >>>> The number of map tasks is set to 2. Should it be larger or related to > >>>> number of CPU cores? > >>>> > >>>> > >>>> > > >