Re: structured data split

Harsh J Fri, 11 Nov 2011 08:07:26 -0800

Sorry Bejoy, I'd typed that URL out from what I remembered on my mind.
Fixed link is: http://wiki.apache.org/hadoop/HadoopMapReduce


2011/11/11 Bejoy KS <bejoy.had...@gmail.com>:
> Thanks Harsh for correcting me with that wonderful piece of information .
> Cleared a wrong assumption on hdfs storage fundamentals today.
>
> Sorry Donal for confusing you over the same.
>
> Harsh,
>        Looks like the link is broken, it'd be great if you could post the
> url once more.
>
> Thanks a lot
>
> Regards
> Bejoy.K.S
>
> On Fri, Nov 11, 2011 at 7:24 PM, Harsh J <ha...@cloudera.com> wrote:
>>
>> Bejoy,
>> This is incorrect. As Denny had explained earlier, blocks are split along
>> byte sizes alone. The writer does not concern itself with newlines and such.
>> When reading, the record readers align themselves to read till the end of
>> lines by communicating with the next block if they have to.
>> This is explained neatly under
>> http://wiki.apache.org/Hadoop/MapReduceArch, para 2 of Map.
>> Regarding structured data, such as XML, one can write their custom
>> InputFormat that returns appropriate split points after scanning through the
>> entire file pre-submit (say, by looking at tags).
>> However, if you want XML, then there is already an XMLInputFormat
>> available in Mahout. For reading N lines at a time, use NLineInputFormat.
>> On 11-Nov-2011, at 6:55 PM, bejoy.had...@gmail.com wrote:
>>
>> Donal
>> In hadoop that hardly happens so. When you are storing data in hdfs it
>> would be split line to blocks depending on end of lines, in case of normal
>> files. It won't be like you'd be having half of a line in one block and the
>> rest in next one. You don't need to worry on that fact.
>> The case you mentioned is like dependent data splits. Hadoop's massive
>> parallel processing could be fully utilized only in case of independent data
>> splits. When data splits are dependent on a file level as I pointed out you
>> can go for WholeFileInputFormat.
>>
>> Please revert if you are still confused. Also if you have some specific
>> scenario, please put that across so we may be able to help you understand
>> better on the map reduce processing of the same.
>>
>> Hope it clarifies...
>> Regards
>> Bejoy K S
>> ________________________________
>> From: 臧冬松 <donal0...@gmail.com>
>> Date: Fri, 11 Nov 2011 20:46:54 +0800
>> To: <hdfs-user@hadoop.apache.org>
>> ReplyTo: hdfs-user@hadoop.apache.org
>> Subject: Re: structured data split
>> Thanks Bejoy!
>> It's better to process the data blocks locally and separately.
>> I just want to know how to deal with a structure (i.e. a word,a line) that
>> is split into two blocks.
>>
>> Cheers,
>> Donal
>>
>> 在 2011年11月11日 下午7:01，Bejoy KS <bejoy.had...@gmail.com>写道：
>>>
>>> Hi Donal
>>>       You can configure your map tasks the way you like to process your
>>> input. If you have file of size 100 mb, it would be divided into two input
>>> blocks and stored in hdfs ( if your dfs.block.size is default 64 Mb). It is
>>> your choice on how you  process the same using map reduce
>>> - With the default TextInputFormat the two blocks would be processed by
>>> two different mappers. (under default split settings) If the blocks are in
>>> two different data nodes then two different mappers mappers would be spanned
>>> in each data node in beat case. ie They are data local map tasks
>>>  - If you want one mapper to process the whole file,change your input
>>> format to WholeFileInputFormat. There a mapper task would be triggred on any
>>> one of the node where the blocks are located. (best case) If both the blocks
>>> are not on the same node then one of the blocks would be transferred to the
>>> map task location for processing.
>>>
>>> Hope it helps!...
>>>
>>> Thank You
>>> Bejoy.K.S
>>>
>>> 2011/11/11 臧冬松 <donal0...@gmail.com>
>>>>
>>>> Thanks Denny!
>>>> So that means each map task will have to read from another DataNode
>>>> inorder to read the end line of the previous block?
>>>>
>>>> Cheers,
>>>> Donal
>>>>
>>>> 2011/11/11 Denny Ye <denny...@gmail.com>
>>>>>
>>>>> hi
>>>>>    Structured data is always being split into different blocks, likes a
>>>>> word or line.
>>>>>    MapReduce task read HDFS data with the unit - line - it will read
>>>>> the whole line from the end of previous block to start of subsequent to
>>>>> obtains that part of line record. So you does not worry about the 
>>>>> Incomplete
>>>>> structured data. HDFS do nothing for this mechanism.
>>>>> -Regards
>>>>> Denny Ye
>>>>>
>>>>> On Fri, Nov 11, 2011 at 3:43 PM, 臧冬松 <donal0...@gmail.com> wrote:
>>>>>>
>>>>>> Usually large file in HDFS is split into bulks and store in different
>>>>>> DataNodes.
>>>>>> A map task is assigned to deal with that bulk, I wonder what if the
>>>>>> Structured data(i.e a word) was split into two bulks?
>>>>>> How MapReduce and HDFS deal with this?
>>>>>>
>>>>>> Thanks!
>>>>>> Donal
>>>>>
>>>>
>>>
>>
>>
>
>



-- 
Harsh J

Re: structured data split

Reply via email to