>> SequenceFiles place sync markers (similar to what 'newlines' mean in
text files) after  a bunch of records, and that is the reason why your
record does not split when read.

Sync is placed after every N records and is used for moving from an
arbitrary location in a file to a start of the next record. A mapper
processes a sequence file block from the first sync in the current block to
the first sync in the next block in the file. There might be some data
transfer from other node to the node where the task is running.

What happens with a block size of 128MB, a key more than 128 MB and a
particular block doesn't have a sync mark? Will the mapper see that there
is no sync mark in the block and doesn't do anything or the block is not
assigned to a mapper?

Regards,
Praveen

On Mon, Dec 5, 2011 at 10:47 AM, Harsh J <ha...@cloudera.com> wrote:

> Florin,
>
> Based on the SequenceFileInputFormat's splitting, you should see just
> one task reading the record. SequenceFiles place sync markers (similar
> to what 'newlines' mean in text files) after  a bunch of records, and
> that is the reason why your record does not split when read.
>
> Also worth thinking about increasing block size for these files to fit
> their contents.
>
> On Thu, Oct 27, 2011 at 9:31 PM, Florin P <florinp...@yahoo.com> wrote:
> > Hello!
> >  Suppose this scenario:
> > 1. The DFS block 64MB
> > 2. We populate a SequenceFile with a binary value that has 200MB (that
> represents a PDF file)
> > In the circumstances of above scenario:
> > 1. How many blocks will be created on HDFS?
> > 2. The number of blocks will be 200MB/64MB aprox 4 blocks?
> > 3. How many task mappers will created? It is the same number as the
> number of blocks?
> > 4. If 4 mappers will be created, then one mapper will process the single
> value of the file, and the other three are just created and stopped?
> >
> > I look forward for your answers.
> > Thank you.
> > Regards,
> >  Florin
> >
> >
>
>
>
> --
> Harsh J
>

Reply via email to