Ok. Got it. Now I have a single file which is of 129MB. Thus, it will be split into two blocks. Now, since my file is a json file, I cannot use textinputformat. As, every input split(logical) will be a single line of the json file. Which I dont want. Thus, in this case, can I write a custom input format and a custom record reader so that, every input split(logical) will have only that part of data which I require.
For. e.g: { "type": "Feature", "properties": { "OSM_NAME": "", "FLAGS": 3.000000, "CLAZZ": 42.000000, "ROAD_TYPE": 3.000000, "END_ID": 33451.000000, "OSM_META": "", "REVERSE_LE": 217.541279, "X1": 77.552595, "OSM_SOURCE": 1520846283.000000, "COST": 0.007058, "OSM_TARGET": 1520846293.000000, "X2": 77.554549, "Y2": 12.993056, "CONGESTED_": 227.541279, "Y1": 12.993107, "REVERSE_CO": 0.007058, "CONGESTION": 10.000000, "OSM_ID": 138697535.000000, "START_ID": 33450.000000, "KM": 0.000000, "LENGTH": 217.541279, "REVERSE__1": 227.541279, "SPEED_IN_K": 30.000000, "ROW_FLAG": "F" }, "geometry": { "type": "LineString", "coordinates": [ [ 8633115.407361, 1458944.819456 ], [ 8633332.869986, 1458938.970140 ] ] } } , { "type": "Feature", "properties": { "OSM_NAME": "", "FLAGS": 3.000000, "CLAZZ": 32.000000, "ROAD_TYPE": 3.000000, "END_ID": 37016.000000, "OSM_META": "", "REVERSE_LE": 156.806535, "X1": 77.538462, "OSM_SOURCE": 1037135286.000000, "COST": 0.003052, "OSM_TARGET": 1551615728.000000, "X2": 77.537950, "Y2": 12.992099, "CONGESTED_": 176.806535, "Y1": 12.993377, "REVERSE_CO": 0.003052, "CONGESTION": 20.000000, "OSM_ID": 89417379.000000, "START_ID": 24882.000000, "KM": 0.000000, "LENGTH": 156.806535, "REVERSE__1": 176.806535, "SPEED_IN_K": 50.000000, "ROW_FLAG": "F" }, "geometry": { "type": "LineString", "coordinates": [ [ 8631542.162393, 1458975.665482 ], [ 8631485.144550, 1458829.592709 ] ] } } *I want here the every input split to consist of entire type data and thus, I can process it accordingly by giving relevant k,V pairs to the map function.* -- Thanks & Regards, Sugandha Naolekar On Wed, Feb 26, 2014 at 2:09 AM, Mohammad Tariq <donta...@gmail.com> wrote: > Hi Sugandha, > > Please find my comments embedded below : > > No. of mappers are decided as: Total_File_Size/Max. > Block Size. Thus, if the file is smaller than the block size, only one > mapper will be invoked. Right? > This is true(but not always). The basic criteria behind > map creation is the logic inside *getSplits* method of *InputFormat*being > used in your MR job. It is the behavior of *file > based InputFormats*, typically sub-classes of *FileInputFormat*, to split > the input data into splits based on the total size, in > bytes, of the input files. See > *this*<http://hadoop.apache.org/docs/current2/api/org/apache/hadoop/mapreduce/InputFormat.html>for > more details. And yes, if the file is smaller than the block size then > only 1 mapper will be created. > > If yes, it means, the map() will be called only once. > Right? In this case, if there are two datanodes with a replication factor > as 1: only one datanode(mapper machine) will > perform the task. Right? > A mapper is called for each split. Don't get confused > with the MR's split and HDFS's block. Both are different(They may overlap > though, as in case of FileInputFormat). HDFS blocks are > physical partitioning of your data, while an InputSplit is just a logical > partitioning. If you have a file which is smaller > than the HDFS blocksize then only one split will be created, hence only 1 > mapper will be called. And this will happen on the node > where this file resides. > > The map() function is called by all the datanodes/slaves > right? If the no. of mappers are more than the no. of slaves, what happens? > map() doesn't get called by anybody. It rather gets > created on the node where the chunk of data to be processed resides. A > slave node can run multiple mappers based on the > availability of CPU slots. > > One more thing to ask: No. of blocks = no. of mappers. > Thus, those many no. of times the map() function will be called right? > No. of blocks = no. of splits = no. of mappers. A map is > called only once per split per node where that split is present. > > HTH > > Warm Regards, > Tariq > cloudfront.blogspot.com > > > On Tue, Feb 25, 2014 at 3:54 PM, Sugandha Naolekar <sugandha....@gmail.com > > wrote: > >> Hi Bertrand, >> >> As you said, no. of HDFS blocks = no. of input splits. But this is only >> true when you set isSplittable() as false or when your input file size is >> less than the block size. Also, when it comes to text files, the default >> textinputformat considers each line as one input split which can be then >> read by RecordReader in K,V format. >> >> Please correct me if I don't make sense. >> >> -- >> Thanks & Regards, >> Sugandha Naolekar >> >> >> >> >> >> On Tue, Feb 25, 2014 at 2:07 PM, Bertrand Dechoux <decho...@gmail.com>wrote: >> >>> The wiki (or Hadoop The Definitive Guide) are good ressources. >>> >>> https://www.inkling.com/read/hadoop-definitive-guide-tom-white-3rd/chapter-7/input-formats >>> >>> Mapper is the name of the abstract class/interface. It does not really >>> make sense to talk about number of mappers. >>> A task is a jvm that can be launched only if there is a free slot ie for >>> a given slot, at a given time, there will be at maximum only a single task. >>> During the task, the configured Mapper will be instantiated. >>> >>> Always : >>> Number of input splits = no. of map tasks >>> >>> And generally : >>> number of hdfs blocks = number of input splits >>> >>> Regards >>> >>> Bertrand >>> >>> PS : I don't know if it is only my client, but avoid red when writting a >>> mail. >>> >>> On Tue, Feb 25, 2014 at 8:49 AM, Dieter De Witte <drdwi...@gmail.com>wrote: >>> >>>> Each node has a tasktracker with a number of map slots. A map slot >>>> hosts as mapper. A mapper executes map tasks. If there are more map tasks >>>> than slots obviously there will be multiple rounds of mapping. >>>> >>>> The map function is called once for each input record. A block is >>>> typically 64MB and can contain a multitude of record, therefore a map task >>>> = run the map() function on all records in the block. >>>> >>>> Number of blocks = no. of map tasks (not mappers) >>>> >>>> Furthermore you have to make a distinction between the two layers. You >>>> have a layer for computations which consists of a jobtracker and a set of >>>> tasktrackers. The other layer is responsible for storage. The HDFS has a >>>> namenode and a set of datanodes. >>>> >>>> In mapreduce the code is executed where the data is. So if a block is >>>> in datanode 1, 2 and 3, then the map task associated with this block will >>>> likely be executed on one of those physical nodes, by tasktracker 1, 2 or >>>> 3. But this is not necessary, thing can be rearranged. >>>> >>>> Hopefully this gives you a little more insigth. >>>> >>>> Regards, Dieter >>>> >>>> >>>> 2014-02-25 7:05 GMT+01:00 Sugandha Naolekar <sugandha....@gmail.com>: >>>> >>>> One more thing to ask: No. of blocks = no. of mappers. Thus, those >>>>> many no. of times the map() function will be called right? >>>>> >>>>> -- >>>>> Thanks & Regards, >>>>> Sugandha Naolekar >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> On Tue, Feb 25, 2014 at 11:27 AM, Sugandha Naolekar < >>>>> sugandha....@gmail.com> wrote: >>>>> >>>>>> Hello, >>>>>> >>>>>> As per the various articles I went through till date, the File(s) are >>>>>> split in chunks/blocks. On the same note, would like to ask few things: >>>>>> >>>>>> >>>>>> 1. No. of mappers are decided as: Total_File_Size/Max. Block >>>>>> Size. Thus, if the file is smaller than the block size, only one >>>>>> mapper >>>>>> will be invoked. Right? >>>>>> 2. If yes, it means, the map() will be called only once. Right? >>>>>> In this case, if there are two datanodes with a replication factor as >>>>>> 1: >>>>>> only one datanode(mapper machine) will perform the task. Right? >>>>>> 3. The map() function is called by all the datanodes/slaves >>>>>> right? If the no. of mappers are more than the no. of slaves, what >>>>>> happens? >>>>>> >>>>>> -- >>>>>> Thanks & Regards, >>>>>> Sugandha Naolekar >>>>>> >>>>>> >>>>>> >>>>>> >>>>> >>>> >>> >> >