Hi Harsh, I just implemented a combineFile InputFormat and its record reader for my case.
Now my input has 10 files each of 233 MB and by using this, My job just runs 1 mapper that processes them. How can I control it by split size i.e. if i say make every split 1 GB i.e. run 3 mappers for these 10 files not 1 ? Thanks, -JJ On Wed, May 25, 2011 at 10:05 AM, Harsh J <ha...@cloudera.com> wrote: > This is the correct behavior. Regular FileInputFormat derivatives > would transform, at the least, one file == one mapper. You need to > look at CombineFileInputFormat/etc. to have multiple files per map > task. > > On Wed, May 25, 2011 at 10:28 PM, Mapred Learn <mapred.le...@gmail.com> > wrote: > > I gave mapred.min.size=1000000000L i.e. 1 GB and each input file is 233 > MB > > and block size = 64 MB. > > With all these values, i thought my split size would work and 4 input > files > > would be combined to get 1 GB input split but somehow this does not > happen > > and I get 10 mappers , each corresponding to 233 MB file. > > > > On Wed, May 25, 2011 at 7:59 AM, Mapred Learn <mapred.le...@gmail.com> > > wrote: > >> > >> Thanks Juwei ! > >> I will go through this.. > >> > >> Sent from my iPhone > >> On May 25, 2011, at 7:51 AM, Juwei Shi <shiju...@gmail.com> wrote: > >> > >> The following are suitable for hadoop 0.20.2. > >> > >> 2011/5/25 Juwei Shi <shiju...@gmail.com> > >>> > >>> The input split size is detemined by map.min.split.size, dfs.block.size > >>> and mapred.map.tasks. > >>> > >>> goalSize = totalSize / mapred.map.tasks > >>> minSize = max {mapred.min.split.size, minSplitSize} > >>> splitSize= max (minSize, min(goalSize, dfs.block.size)) > >>> > >>> minSplitSize is determined by each InputFormat such as > >>> SequenceFileInputFormat. > >>> > >>> You may want to refer to FileInputFormat.java for more details. > >>> > >>> > >>> 2011/5/25 Mapred Learn <mapred.le...@gmail.com> > >>>> > >>>> Resending ====> > >>>> > >>>> > >>>> > Hi, > >>>> > I have few input splits that are few MB in size. > >>>> > I want to submit 1 GB of input to every mapper. Does anyone know how > >>>> > can I do it ? > >>>> > Currently each mapper gets one input split that results in many > small > >>>> > map-output files. > >>>> > > >>>> > I tried setting -Dmapred.map.min.split.size=<number> , but still it > >>>> > does not take effect. > >>>> > > >>>> > Thanks, > >>>> > -JJ > >>> > >>> > >>> > >>> -- > >>> - Juwei Shi > >> > >> > >> > >> -- > >> - Juwei Shi (史巨伟) > > > > > > > > -- > Harsh J >