It worked me. Thanks a lot Bejoy. Thanks Thamizh
On Fri, Feb 17, 2012 at 3:08 PM, Bejoy Ks <bejoy.had...@gmail.com> wrote: > Hi Tamizh > MultiFileInputFormat / CombineFileInputFormat is typically used > where the input files are relatively small (typically less than a block > size). When you use these, there is some loss in data locality, as all the > splits a mapper process won't be in the same node. > TextInputFormat spawns one mapper each for one block in default > (not one per file). Here you hold data locality pretty much compared to > MultiFileInputFormat. > If your mapper is not very short lived and has some decent amount of > processing involved then you can go with TextInputFormat . The one > consideration you need to make is, on your specified input when this job is > running it may span a larger number of map tasks there by occupying almost > all your map task slots in your cluster. If there are other tasks to be > triggred they may have to wait for free map slots. You may need to consider > using a Scheduler for fair share of slots to other parallel jobs as well, > if any. > > Regards > Bejoy.K.S > > > > On Fri, Feb 17, 2012 at 10:26 AM, Thamizhannal Paramasivam < > thamizhanna...@gmail.com> wrote: > >> Thank you so much to Joey & Bejoy for your suggestions. >> >> The Job's input path has 1300-1400 text files and each of 100-200MB. >> >> I thought, TextInputFormat spans single mapper per file and >> MultiFileInputFormat spans less number mapper(<(1300-1400)) that processes >> more many input files. >> >> Which input format do you thing would be most appropriate in my case and >> why? >> >> Looking forward to your reply. >> >> Thanks, >> Thamizh >> >> >> >> On Thu, Feb 16, 2012 at 10:06 PM, Joey Echeverria <j...@cloudera.com>wrote: >> >>> Is your data size 100-200MB *total*? >>> >>> If so, then this is the expected behavior for MultiFileInputFormat. As >>> Bejoy says, you can switch to TextInputFormat to get one mapper per block >>> (min one mapper per file). >>> >>> -Joey >>> >>> >>> On Thu, Feb 16, 2012 at 11:03 AM, Thamizhannal Paramasivam < >>> thamizhanna...@gmail.com> wrote: >>> >>>> Here are the input format for mapper. >>>> Input Format: MultiFileInputFormat >>>> MapperOutputKey : Text >>>> MapperOutputValue: CustomWritable >>>> >>>> I shall not be in the position to upgrade hadoop-0.19.2 for some reason. >>>> >>>> I have checked in number of mapper on job-tracker. >>>> >>>> Thanks, >>>> Thamizh >>>> >>>> >>>> On Thu, Feb 16, 2012 at 6:56 PM, Joey Echeverria <j...@cloudera.com>wrote: >>>> >>>>> Hi Tamil, >>>>> >>>>> I'd recommend upgrading to a newer release as 0.19.2 is very old. As >>>>> for your question, most input formats should set the number mappers >>>>> correctly. What input format are you using? Where did you see the number >>>>> of >>>>> tasks it assigned to the job? >>>>> >>>>> -Joey >>>>> >>>>> >>>>> On Thu, Feb 16, 2012 at 1:40 AM, Thamizhannal Paramasivam < >>>>> thamizhanna...@gmail.com> wrote: >>>>> >>>>>> Hi All, >>>>>> I am using hadoop-0.19.2 and running a Mapper only Job on cluster. >>>>>> It's input path has >1000 files of 100-200MB. Since, it is Mapper only >>>>>> job, >>>>>> I gave number Of reducer=0. So, it is using 2 mapper to run all the input >>>>>> files. If we did not state the number of mapper, would n't it pick the 1 >>>>>> mapper per input file? Or Does the default won't it pick a fair num of >>>>>> mapper according to number input file? >>>>>> Thanks, >>>>>> tamil >>>>> >>>>> >>>>> >>>>> >>>>> -- >>>>> Joseph Echeverria >>>>> Cloudera, Inc. >>>>> 443.305.9434 >>>>> >>>>> >>>> >>> >>> >>> -- >>> Joseph Echeverria >>> Cloudera, Inc. >>> 443.305.9434 >>> >>> >> >