Hey Jason, Is the file pre-sorted? You could override the OutputFormat's #getSplits method to return InputSplits at identified key boundaries, as one solution - this would require reading the file up-front (at submit-time) and building the input splits out of it.
On Wed, Sep 12, 2012 at 8:45 AM, Jason Yang <lin.yang.ja...@gmail.com> wrote: > Hi, > > I have a sequence file written by SequenceFileOutputFormat with key/value > type of <Text, BytesWritable>, like below: > > Text BytesWritable > ------------------------------------------------------------- > id_A_01 7F2B3C687F2B3C687F2B3C68 > id_A_02 2F2B3C687F2B3C687F2B3C686AB23C68D73C68D7 > id_A_03 5F2B3C68D77F2B3C687F2B3A > ... > id_B_01 1AB23C68D73C68D76AB23C68D73C68D7 > id_B_02 5AB23C68D73C68D76AB68D76A1 > id_B_03 F2B23C68D7B23C68D7B23C68D7 > > If I want all the records with the same key prefix to be processed by a same > mapper, say records with key id_A_XX are processed by a mapper and records > with key id_B_XX are processed by another mapper, what should I do? > > Should I implement our own InputFormat inherited from > SequenceFileInputFormat ? > > Any help would be appreciated. > -- > YANG, Lin > -- Harsh J