If the file is pre-sorted, why not just make multiple sequence files - 1 for each split?
Then you don't have to compute InputSplits because the physical files are already split. On Tue, Sep 11, 2012 at 11:00 PM, Harsh J <ha...@cloudera.com> wrote: > Hey Jason, > > Is the file pre-sorted? You could override the OutputFormat's > #getSplits method to return InputSplits at identified key boundaries, > as one solution - this would require reading the file up-front (at > submit-time) and building the input splits out of it. > > On Wed, Sep 12, 2012 at 8:45 AM, Jason Yang <lin.yang.ja...@gmail.com> wrote: >> Hi, >> >> I have a sequence file written by SequenceFileOutputFormat with key/value >> type of <Text, BytesWritable>, like below: >> >> Text BytesWritable >> ------------------------------------------------------------- >> id_A_01 7F2B3C687F2B3C687F2B3C68 >> id_A_02 2F2B3C687F2B3C687F2B3C686AB23C68D73C68D7 >> id_A_03 5F2B3C68D77F2B3C687F2B3A >> ... >> id_B_01 1AB23C68D73C68D76AB23C68D73C68D7 >> id_B_02 5AB23C68D73C68D76AB68D76A1 >> id_B_03 F2B23C68D7B23C68D7B23C68D7 >> >> If I want all the records with the same key prefix to be processed by a same >> mapper, say records with key id_A_XX are processed by a mapper and records >> with key id_B_XX are processed by another mapper, what should I do? >> >> Should I implement our own InputFormat inherited from >> SequenceFileInputFormat ? >> >> Any help would be appreciated. >> -- >> YANG, Lin >> > > > > -- > Harsh J