Re: structured data split

Bejoy KS Fri, 11 Nov 2011 08:28:05 -0800

Thanks Harsh !...

2011/11/11 Harsh J <ha...@cloudera.com>


> Sorry Bejoy, I'd typed that URL out from what I remembered on my mind.
> Fixed link is: http://wiki.apache.org/hadoop/HadoopMapReduce
>
> 2011/11/11 Bejoy KS <bejoy.had...@gmail.com>:
> > Thanks Harsh for correcting me with that wonderful piece of information .
> > Cleared a wrong assumption on hdfs storage fundamentals today.
> >
> > Sorry Donal for confusing you over the same.
> >
> > Harsh,
> >        Looks like the link is broken, it'd be great if you could post the
> > url once more.
> >
> > Thanks a lot
> >
> > Regards
> > Bejoy.K.S
> >
> > On Fri, Nov 11, 2011 at 7:24 PM, Harsh J <ha...@cloudera.com> wrote:
> >>
> >> Bejoy,
> >> This is incorrect. As Denny had explained earlier, blocks are split
> along
> >> byte sizes alone. The writer does not concern itself with newlines and
> such.
> >> When reading, the record readers align themselves to read till the end
> of
> >> lines by communicating with the next block if they have to.
> >> This is explained neatly under
> >> http://wiki.apache.org/Hadoop/MapReduceArch, para 2 of Map.
> >> Regarding structured data, such as XML, one can write their custom
> >> InputFormat that returns appropriate split points after scanning
> through the
> >> entire file pre-submit (say, by looking at tags).
> >> However, if you want XML, then there is already an XMLInputFormat
> >> available in Mahout. For reading N lines at a time, use
> NLineInputFormat.
> >> On 11-Nov-2011, at 6:55 PM, bejoy.had...@gmail.com wrote:
> >>
> >> Donal
> >> In hadoop that hardly happens so. When you are storing data in hdfs it
> >> would be split line to blocks depending on end of lines, in case of
> normal
> >> files. It won't be like you'd be having half of a line in one block and
> the
> >> rest in next one. You don't need to worry on that fact.
> >> The case you mentioned is like dependent data splits. Hadoop's massive
> >> parallel processing could be fully utilized only in case of independent
> data
> >> splits. When data splits are dependent on a file level as I pointed out
> you
> >> can go for WholeFileInputFormat.
> >>
> >> Please revert if you are still confused. Also if you have some specific
> >> scenario, please put that across so we may be able to help you
> understand
> >> better on the map reduce processing of the same.
> >>
> >> Hope it clarifies...
> >> Regards
> >> Bejoy K S
> >> ________________________________
> >> From: 臧冬松 <donal0...@gmail.com>
> >> Date: Fri, 11 Nov 2011 20:46:54 +0800
> >> To: <hdfs-user@hadoop.apache.org>
> >> ReplyTo: hdfs-user@hadoop.apache.org
> >> Subject: Re: structured data split
> >> Thanks Bejoy!
> >> It's better to process the data blocks locally and separately.
> >> I just want to know how to deal with a structure (i.e. a word,a line)
> that
> >> is split into two blocks.
> >>
> >> Cheers,
> >> Donal
> >>
> >> 在 2011年11月11日 下午7:01，Bejoy KS <bejoy.had...@gmail.com>写道：
> >>>
> >>> Hi Donal
> >>>       You can configure your map tasks the way you like to process your
> >>> input. If you have file of size 100 mb, it would be divided into two
> input
> >>> blocks and stored in hdfs ( if your dfs.block.size is default 64 Mb).
> It is
> >>> your choice on how you  process the same using map reduce
> >>> - With the default TextInputFormat the two blocks would be processed by
> >>> two different mappers. (under default split settings) If the blocks
> are in
> >>> two different data nodes then two different mappers mappers would be
> spanned
> >>> in each data node in beat case. ie They are data local map tasks
> >>>  - If you want one mapper to process the whole file,change your input
> >>> format to WholeFileInputFormat. There a mapper task would be triggred
> on any
> >>> one of the node where the blocks are located. (best case) If both the
> blocks
> >>> are not on the same node then one of the blocks would be transferred
> to the
> >>> map task location for processing.
> >>>
> >>> Hope it helps!...
> >>>
> >>> Thank You
> >>> Bejoy.K.S
> >>>
> >>> 2011/11/11 臧冬松 <donal0...@gmail.com>
> >>>>
> >>>> Thanks Denny!
> >>>> So that means each map task will have to read from another DataNode
> >>>> inorder to read the end line of the previous block?
> >>>>
> >>>> Cheers,
> >>>> Donal
> >>>>
> >>>> 2011/11/11 Denny Ye <denny...@gmail.com>
> >>>>>
> >>>>> hi
> >>>>>    Structured data is always being split into different blocks,
> likes a
> >>>>> word or line.
> >>>>>    MapReduce task read HDFS data with the unit - line - it will read
> >>>>> the whole line from the end of previous block to start of subsequent
> to
> >>>>> obtains that part of line record. So you does not worry about the
> Incomplete
> >>>>> structured data. HDFS do nothing for this mechanism.
> >>>>> -Regards
> >>>>> Denny Ye
> >>>>>
> >>>>> On Fri, Nov 11, 2011 at 3:43 PM, 臧冬松 <donal0...@gmail.com> wrote:
> >>>>>>
> >>>>>> Usually large file in HDFS is split into bulks and store in
> different
> >>>>>> DataNodes.
> >>>>>> A map task is assigned to deal with that bulk, I wonder what if the
> >>>>>> Structured data(i.e a word) was split into two bulks?
> >>>>>> How MapReduce and HDFS deal with this?
> >>>>>>
> >>>>>> Thanks!
> >>>>>> Donal
> >>>>>
> >>>>
> >>>
> >>
> >>
> >
> >
>
>
>
> --
> Harsh J
>

Re: structured data split

Reply via email to