Re: Splitting SequenceFile in controlled manner

Harsh J Tue, 06 Dec 2011 12:52:39 -0800

Majid,

Yes. Simply put, your records shall never break. We do not read just at the 
split boundaries, we may extend beyond boundaries until a sync marker is 
encountered in order to complete a record or series of records. The subsequent 
mappers will always skip until their first sync marker, and then begin reading 
- to avoid duplication. This is exactly how text file reading works as well -- 
only here, it is newlines.


On 07-Dec-2011, at 1:53 AM, Majid Azimi wrote:

> So if we have a map job analysing only the second block of the log file, it
> should not transfer any other parts of that from other nodes because that
> part is stand alone and meaning full split? Am I right?
> 
> On Tue, Dec 6, 2011 at 11:32 PM, Harsh J <ha...@cloudera.com> wrote:
> 
>> Majid,
>> 
>> Sync markers are written into sequence files already, they are part of the
>> format. This is nothing to worry about - and is simple enough to test and
>> be confident about. The mechanism is same as reading a text file with
>> newlines - the reader will ensure reading off the boundary data in order to
>> complete a record if it has to.
>> 
>> On 07-Dec-2011, at 1:25 AM, Majid Azimi wrote:
>> 
>>> hadoop writes in a SequenceFile in in key-value pair(record) format.
>>> Consider we have a large unbounded log file. Hadoop will split the file
>>> based on block size and save them on multiple data nodes. Is it
>> guaranteed
>>> that each key-value pair will reside on a single block? or we may have a
>>> case so that key is in one block on node 1 and value(or parts of it) on
>>> second block on node 2? If we may have unmeaning-full splits, then what
>> is
>>> the solution? sync markers?
>>> 
>>> Another question is: Does hadoop automatically write sync markers or we
>>> should write it manually?
>> 
>>

Re: Splitting SequenceFile in controlled manner

Reply via email to