Bejoy,

This is incorrect. As Denny had explained earlier, blocks are split along byte 
sizes alone. The writer does not concern itself with newlines and such. When 
reading, the record readers align themselves to read till the end of lines by 
communicating with the next block if they have to.

This is explained neatly under http://wiki.apache.org/Hadoop/MapReduceArch, 
para 2 of Map.

Regarding structured data, such as XML, one can write their custom InputFormat 
that returns appropriate split points after scanning through the entire file 
pre-submit (say, by looking at tags). 

However, if you want XML, then there is already an XMLInputFormat available in 
Mahout. For reading N lines at a time, use NLineInputFormat.

On 11-Nov-2011, at 6:55 PM, bejoy.had...@gmail.com wrote:

> Donal
> In hadoop that hardly happens so. When you are storing data in hdfs it would 
> be split line to blocks depending on end of lines, in case of normal files. 
> It won't be like you'd be having half of a line in one block and the rest in 
> next one. You don't need to worry on that fact.
> The case you mentioned is like dependent data splits. Hadoop's massive 
> parallel processing could be fully utilized only in case of independent data 
> splits. When data splits are dependent on a file level as I pointed out you 
> can go for WholeFileInputFormat.
> 
> Please revert if you are still confused. Also if you have some specific 
> scenario, please put that across so we may be able to help you understand 
> better on the map reduce processing of the same.
> 
> Hope it clarifies...
> Regards
> Bejoy K S
> From: 臧冬松 <donal0...@gmail.com>
> Date: Fri, 11 Nov 2011 20:46:54 +0800
> To: <hdfs-user@hadoop.apache.org>
> ReplyTo: hdfs-user@hadoop.apache.org
> Subject: Re: structured data split
> 
> Thanks Bejoy!
> It's better to process the data blocks locally and separately.
> I just want to know how to deal with a structure (i.e. a word,a line) that is 
> split into two blocks.
> 
> Cheers,
> Donal
> 
> 在 2011年11月11日 下午7:01,Bejoy KS <bejoy.had...@gmail.com>写道:
> Hi Donal
>       You can configure your map tasks the way you like to process your 
> input. If you have file of size 100 mb, it would be divided into two input 
> blocks and stored in hdfs ( if your dfs.block.size is default 64 Mb). It is 
> your choice on how you  process the same using map reduce
> - With the default TextInputFormat the two blocks would be processed by two 
> different mappers. (under default split settings) If the blocks are in two 
> different data nodes then two different mappers mappers would be spanned in 
> each data node in beat case. ie They are data local map tasks
>  - If you want one mapper to process the whole file,change your input format 
> to WholeFileInputFormat. There a mapper task would be triggred on any one of 
> the node where the blocks are located. (best case) If both the blocks are not 
> on the same node then one of the blocks would be transferred to the map task 
> location for processing.
> 
> Hope it helps!...
> 
> Thank You
> Bejoy.K.S
> 
> 
> 2011/11/11 臧冬松 <donal0...@gmail.com>
> Thanks Denny!
> So that means each map task will have to read from another DataNode inorder 
> to read the end line of the previous block?
> 
> Cheers,
> Donal
> 
> 
> 2011/11/11 Denny Ye <denny...@gmail.com>
> hi
>    Structured data is always being split into different blocks, likes a word 
> or line. 
>    MapReduce task read HDFS data with the unit - line - it will read the 
> whole line from the end of previous block to start of subsequent to obtains 
> that part of line record. So you does not worry about the Incomplete 
> structured data. HDFS do nothing for this mechanism.
> 
> -Regards
> Denny Ye
> 
> 
> On Fri, Nov 11, 2011 at 3:43 PM, 臧冬松 <donal0...@gmail.com> wrote:
> Usually large file in HDFS is split into bulks and store in different 
> DataNodes.
> A map task is assigned to deal with that bulk, I wonder what if the 
> Structured data(i.e a word) was split into two bulks?
> How MapReduce and HDFS deal with this?
> 
> Thanks!
> Donal
> 
> 
> 
> 

Reply via email to