Thanks Harsh !... 2011/11/11 Harsh J <ha...@cloudera.com>
> Sorry Bejoy, I'd typed that URL out from what I remembered on my mind. > Fixed link is: http://wiki.apache.org/hadoop/HadoopMapReduce > > 2011/11/11 Bejoy KS <bejoy.had...@gmail.com>: > > Thanks Harsh for correcting me with that wonderful piece of information . > > Cleared a wrong assumption on hdfs storage fundamentals today. > > > > Sorry Donal for confusing you over the same. > > > > Harsh, > > Looks like the link is broken, it'd be great if you could post the > > url once more. > > > > Thanks a lot > > > > Regards > > Bejoy.K.S > > > > On Fri, Nov 11, 2011 at 7:24 PM, Harsh J <ha...@cloudera.com> wrote: > >> > >> Bejoy, > >> This is incorrect. As Denny had explained earlier, blocks are split > along > >> byte sizes alone. The writer does not concern itself with newlines and > such. > >> When reading, the record readers align themselves to read till the end > of > >> lines by communicating with the next block if they have to. > >> This is explained neatly under > >> http://wiki.apache.org/Hadoop/MapReduceArch, para 2 of Map. > >> Regarding structured data, such as XML, one can write their custom > >> InputFormat that returns appropriate split points after scanning > through the > >> entire file pre-submit (say, by looking at tags). > >> However, if you want XML, then there is already an XMLInputFormat > >> available in Mahout. For reading N lines at a time, use > NLineInputFormat. > >> On 11-Nov-2011, at 6:55 PM, bejoy.had...@gmail.com wrote: > >> > >> Donal > >> In hadoop that hardly happens so. When you are storing data in hdfs it > >> would be split line to blocks depending on end of lines, in case of > normal > >> files. It won't be like you'd be having half of a line in one block and > the > >> rest in next one. You don't need to worry on that fact. > >> The case you mentioned is like dependent data splits. Hadoop's massive > >> parallel processing could be fully utilized only in case of independent > data > >> splits. When data splits are dependent on a file level as I pointed out > you > >> can go for WholeFileInputFormat. > >> > >> Please revert if you are still confused. Also if you have some specific > >> scenario, please put that across so we may be able to help you > understand > >> better on the map reduce processing of the same. > >> > >> Hope it clarifies... > >> Regards > >> Bejoy K S > >> ________________________________ > >> From: 臧冬松 <donal0...@gmail.com> > >> Date: Fri, 11 Nov 2011 20:46:54 +0800 > >> To: <hdfs-user@hadoop.apache.org> > >> ReplyTo: hdfs-user@hadoop.apache.org > >> Subject: Re: structured data split > >> Thanks Bejoy! > >> It's better to process the data blocks locally and separately. > >> I just want to know how to deal with a structure (i.e. a word,a line) > that > >> is split into two blocks. > >> > >> Cheers, > >> Donal > >> > >> 在 2011年11月11日 下午7:01,Bejoy KS <bejoy.had...@gmail.com>写道: > >>> > >>> Hi Donal > >>> You can configure your map tasks the way you like to process your > >>> input. If you have file of size 100 mb, it would be divided into two > input > >>> blocks and stored in hdfs ( if your dfs.block.size is default 64 Mb). > It is > >>> your choice on how you process the same using map reduce > >>> - With the default TextInputFormat the two blocks would be processed by > >>> two different mappers. (under default split settings) If the blocks > are in > >>> two different data nodes then two different mappers mappers would be > spanned > >>> in each data node in beat case. ie They are data local map tasks > >>> - If you want one mapper to process the whole file,change your input > >>> format to WholeFileInputFormat. There a mapper task would be triggred > on any > >>> one of the node where the blocks are located. (best case) If both the > blocks > >>> are not on the same node then one of the blocks would be transferred > to the > >>> map task location for processing. > >>> > >>> Hope it helps!... > >>> > >>> Thank You > >>> Bejoy.K.S > >>> > >>> 2011/11/11 臧冬松 <donal0...@gmail.com> > >>>> > >>>> Thanks Denny! > >>>> So that means each map task will have to read from another DataNode > >>>> inorder to read the end line of the previous block? > >>>> > >>>> Cheers, > >>>> Donal > >>>> > >>>> 2011/11/11 Denny Ye <denny...@gmail.com> > >>>>> > >>>>> hi > >>>>> Structured data is always being split into different blocks, > likes a > >>>>> word or line. > >>>>> MapReduce task read HDFS data with the unit - line - it will read > >>>>> the whole line from the end of previous block to start of subsequent > to > >>>>> obtains that part of line record. So you does not worry about the > Incomplete > >>>>> structured data. HDFS do nothing for this mechanism. > >>>>> -Regards > >>>>> Denny Ye > >>>>> > >>>>> On Fri, Nov 11, 2011 at 3:43 PM, 臧冬松 <donal0...@gmail.com> wrote: > >>>>>> > >>>>>> Usually large file in HDFS is split into bulks and store in > different > >>>>>> DataNodes. > >>>>>> A map task is assigned to deal with that bulk, I wonder what if the > >>>>>> Structured data(i.e a word) was split into two bulks? > >>>>>> How MapReduce and HDFS deal with this? > >>>>>> > >>>>>> Thanks! > >>>>>> Donal > >>>>> > >>>> > >>> > >> > >> > > > > > > > > -- > Harsh J >