> 1. Any reason not to use a sequence file for this?  Perhaps a mapfile?
>  Since I've sorted it, I don't need "random" accesses, but I do need
> to be aware of the keys, as I need to be sure that I get all of the
> relevant keys sent to a given mapper

MapFile *may* be better here (see my answer for 2 below).

> 2. Looks like I want a custom inputformat for this, extending
> SequenceFileInputFormat.  Do you agree?  I'll gladly take some
> opinions on this, as I ultimately want to split the based on what's in
> the file, which might be a little unorthodox.

If you need to split based on where certain keys are in the file, then
a SequenceFile isn't a great solution. It would require that your
InputFormat scan through all of the data just to find split points.
Assuming you know what keys to split on ahead of time, you could use
MapFiles and find the exact split point more quickly.

> 3. Another idea might be create separate seq files for chunk of
> records and make them non-splittable, ensuring that they go to a
> single mapper.  Assuming I can get away with this, see any pros/cons
> with that approach?

Separate sequence files would require the least amount of custom code.

-Joey

-- 
Joseph Echeverria
Cloudera, Inc.
443.305.9434

Reply via email to