Re: Reading data output by MapFileOutputFormat

Harsh J Mon, 23 Apr 2012 04:09:54 -0700

Ali,

MapFiles are explained at
http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/MapFile.html
- Please give it a read and it should solve half your questions. In
short, MapFile is two files - one raw SequenceFile and another an
index file built on top of it.


The reason MR does not provide a MapFileInputFormat is that you don't
need to use the index file in MR jobs (no lookups for input-driven
jobs). Hence the SequenceFileInputFormat suffices to read the data (it
ignores the index file, and only reads the sequence ones that carries
the data).

If you wish to make use of MapFile's index abilities for lookups/etc.,
use the MapFile.Reader class directly in your implementation.

On Mon, Apr 23, 2012 at 4:23 PM, Ali Safdar Kureishy
<safdar.kurei...@gmail.com> wrote:
> Hi,
>
> If I use a *MapFileOutputFormat* to output some data, I see that each
> reducer's output is a folder ("part-00000", for example), and inside that
> folder are two files: "data" and "index".
>
> However, there is no corresponding MapFileInputFormat, to read back this
> folder ("part-00000"). Instead, *SequenceFileInputFormat* seems to read the
> data. So, I have some questions:
> - does SequenceFileInputFormat actually read *all* the data that was output
> by MapFileOutputFormat? Or is some relationship data between the data and
> index files lost in this process that would have been better handled by
> another InputFormat class? In other words, is SequenceFileInputFormat the
> right InputFormat to read data written by MapFileOutputFormat?
> - how is it that SequenceFileInputFormat works to read outputs from
> *both*MapFileOutputFormat and SequenceFileOutputFormat? That would
> imply that
> MapFileOutputFormat and SequenceFileOutputFormat output the same data, OR
> that SequenceFileInputFormat internally handles both differently. What is
> the reality?
>
> Thanks,
> Safdar



-- 
Harsh J

Re: Reading data output by MapFileOutputFormat

Reply via email to