Re: Splitting SequenceFile in controlled manner

Majid Azimi Tue, 06 Dec 2011 12:24:02 -0800

So if we have a map job analysing only the second block of the log file, it
should not transfer any other parts of that from other nodes because that
part is stand alone and meaning full split? Am I right?


On Tue, Dec 6, 2011 at 11:32 PM, Harsh J <ha...@cloudera.com> wrote:

> Majid,
>
> Sync markers are written into sequence files already, they are part of the
> format. This is nothing to worry about - and is simple enough to test and
> be confident about. The mechanism is same as reading a text file with
> newlines - the reader will ensure reading off the boundary data in order to
> complete a record if it has to.
>
> On 07-Dec-2011, at 1:25 AM, Majid Azimi wrote:
>
> > hadoop writes in a SequenceFile in in key-value pair(record) format.
> > Consider we have a large unbounded log file. Hadoop will split the file
> > based on block size and save them on multiple data nodes. Is it
> guaranteed
> > that each key-value pair will reside on a single block? or we may have a
> > case so that key is in one block on node 1 and value(or parts of it) on
> > second block on node 2? If we may have unmeaning-full splits, then what
> is
> > the solution? sync markers?
> >
> > Another question is: Does hadoop automatically write sync markers or we
> > should write it manually?
>
>

Re: Splitting SequenceFile in controlled manner

Reply via email to