Re: structured data split

臧冬松 Fri, 11 Nov 2011 04:47:27 -0800

Thanks Bejoy!
It's better to process the data blocks locally and separately.
I just want to know how to deal with a structure (i.e. a word,a line) that
is split into two blocks.


Cheers,
Donal

在 2011年11月11日 下午7:01，Bejoy KS <bejoy.had...@gmail.com>写道：

> Hi Donal
>       You can configure your map tasks the way you like to process your
> input. If you have file of size 100 mb, it would be divided into two input
> blocks and stored in hdfs ( if your dfs.block.size is default 64 Mb). It is
> your choice on how you  process the same using map reduce
> - With the default TextInputFormat the two blocks would be processed by
> two different mappers. (under default split settings) If the blocks are in
> two different data nodes then two different mappers mappers would be
> spanned in each data node in beat case. ie They are data local map tasks
>  - If you want one mapper to process the whole file,change your input
> format to WholeFileInputFormat. There a mapper task would be triggred on
> any one of the node where the blocks are located. (best case) If both the
> blocks are not on the same node then one of the blocks would be transferred
> to the map task location for processing.
>
> Hope it helps!...
>
> Thank You
> Bejoy.K.S
>
>
> 2011/11/11 臧冬松 <donal0...@gmail.com>
>
>> Thanks Denny!
>> So that means each map task will have to read from another DataNode
>> inorder to read the end line of the previous block?
>>
>> Cheers,
>> Donal
>>
>>
>> 2011/11/11 Denny Ye <denny...@gmail.com>
>>
>>> hi
>>>    Structured data is always being split into different blocks, likes a
>>> word or line.
>>>    MapReduce task read HDFS data with the unit - *line* - it will read
>>> the whole line from the end of previous block to start of subsequent to
>>> obtains that part of line record. So you does not worry about the
>>> Incomplete structured data. HDFS do nothing for this mechanism.
>>>
>>> -Regards
>>> Denny Ye
>>>
>>>
>>> On Fri, Nov 11, 2011 at 3:43 PM, 臧冬松 <donal0...@gmail.com> wrote:
>>>
>>>> Usually large file in HDFS is split into bulks and store in different
>>>> DataNodes.
>>>> A map task is assigned to deal with that bulk, I wonder what if the
>>>> Structured data(i.e a word) was split into two bulks?
>>>> How MapReduce and HDFS deal with this?
>>>>
>>>> Thanks!
>>>> Donal
>>>>
>>>
>>>
>>
>

Re: structured data split

Reply via email to