Sorry Bejoy, I'd typed that URL out from what I remembered on my mind. Fixed link is: http://wiki.apache.org/hadoop/HadoopMapReduce
2011/11/11 Bejoy KS <bejoy.had...@gmail.com>: > Thanks Harsh for correcting me with that wonderful piece of information . > Cleared a wrong assumption on hdfs storage fundamentals today. > > Sorry Donal for confusing you over the same. > > Harsh, > Looks like the link is broken, it'd be great if you could post the > url once more. > > Thanks a lot > > Regards > Bejoy.K.S > > On Fri, Nov 11, 2011 at 7:24 PM, Harsh J <ha...@cloudera.com> wrote: >> >> Bejoy, >> This is incorrect. As Denny had explained earlier, blocks are split along >> byte sizes alone. The writer does not concern itself with newlines and such. >> When reading, the record readers align themselves to read till the end of >> lines by communicating with the next block if they have to. >> This is explained neatly under >> http://wiki.apache.org/Hadoop/MapReduceArch, para 2 of Map. >> Regarding structured data, such as XML, one can write their custom >> InputFormat that returns appropriate split points after scanning through the >> entire file pre-submit (say, by looking at tags). >> However, if you want XML, then there is already an XMLInputFormat >> available in Mahout. For reading N lines at a time, use NLineInputFormat. >> On 11-Nov-2011, at 6:55 PM, bejoy.had...@gmail.com wrote: >> >> Donal >> In hadoop that hardly happens so. When you are storing data in hdfs it >> would be split line to blocks depending on end of lines, in case of normal >> files. It won't be like you'd be having half of a line in one block and the >> rest in next one. You don't need to worry on that fact. >> The case you mentioned is like dependent data splits. Hadoop's massive >> parallel processing could be fully utilized only in case of independent data >> splits. When data splits are dependent on a file level as I pointed out you >> can go for WholeFileInputFormat. >> >> Please revert if you are still confused. Also if you have some specific >> scenario, please put that across so we may be able to help you understand >> better on the map reduce processing of the same. >> >> Hope it clarifies... >> Regards >> Bejoy K S >> ________________________________ >> From: 臧冬松 <donal0...@gmail.com> >> Date: Fri, 11 Nov 2011 20:46:54 +0800 >> To: <hdfs-user@hadoop.apache.org> >> ReplyTo: hdfs-user@hadoop.apache.org >> Subject: Re: structured data split >> Thanks Bejoy! >> It's better to process the data blocks locally and separately. >> I just want to know how to deal with a structure (i.e. a word,a line) that >> is split into two blocks. >> >> Cheers, >> Donal >> >> 在 2011年11月11日 下午7:01,Bejoy KS <bejoy.had...@gmail.com>写道: >>> >>> Hi Donal >>> You can configure your map tasks the way you like to process your >>> input. If you have file of size 100 mb, it would be divided into two input >>> blocks and stored in hdfs ( if your dfs.block.size is default 64 Mb). It is >>> your choice on how you process the same using map reduce >>> - With the default TextInputFormat the two blocks would be processed by >>> two different mappers. (under default split settings) If the blocks are in >>> two different data nodes then two different mappers mappers would be spanned >>> in each data node in beat case. ie They are data local map tasks >>> - If you want one mapper to process the whole file,change your input >>> format to WholeFileInputFormat. There a mapper task would be triggred on any >>> one of the node where the blocks are located. (best case) If both the blocks >>> are not on the same node then one of the blocks would be transferred to the >>> map task location for processing. >>> >>> Hope it helps!... >>> >>> Thank You >>> Bejoy.K.S >>> >>> 2011/11/11 臧冬松 <donal0...@gmail.com> >>>> >>>> Thanks Denny! >>>> So that means each map task will have to read from another DataNode >>>> inorder to read the end line of the previous block? >>>> >>>> Cheers, >>>> Donal >>>> >>>> 2011/11/11 Denny Ye <denny...@gmail.com> >>>>> >>>>> hi >>>>> Structured data is always being split into different blocks, likes a >>>>> word or line. >>>>> MapReduce task read HDFS data with the unit - line - it will read >>>>> the whole line from the end of previous block to start of subsequent to >>>>> obtains that part of line record. So you does not worry about the >>>>> Incomplete >>>>> structured data. HDFS do nothing for this mechanism. >>>>> -Regards >>>>> Denny Ye >>>>> >>>>> On Fri, Nov 11, 2011 at 3:43 PM, 臧冬松 <donal0...@gmail.com> wrote: >>>>>> >>>>>> Usually large file in HDFS is split into bulks and store in different >>>>>> DataNodes. >>>>>> A map task is assigned to deal with that bulk, I wonder what if the >>>>>> Structured data(i.e a word) was split into two bulks? >>>>>> How MapReduce and HDFS deal with this? >>>>>> >>>>>> Thanks! >>>>>> Donal >>>>> >>>> >>> >> >> > > -- Harsh J