Jason, correct me if I am wrong- opening Sequence file in the configure (or setup method in 0.20) and writing to it is same as doing output.collect( ), unless you mean I should make the sequence file writer static variable and set reuse jvm flag to -1. In that case the subsequent mappers might be run in the same JVM and they can use the same writer and hence produce one file. But in that case I need to add a hook to close the writer - may be use the shutdown hook.
Jothi, the idea of combine input format is good. But I guess I have to write somethign of my own to make it work in my case. Thanks guys for the suggestions... but I feel we should have some support from the framework to merge the output of mapper only job so that we don't get a lot number of smaller files. Sometimes you just don't want to run reducers and unnecessarily transfer a whole lot of data across the network. Thanks, Tarandeep On Wed, Jun 17, 2009 at 7:57 PM, jason hadoop <jason.had...@gmail.com>wrote: > You can open your sequence file in the mapper configure method, write to it > in your map, and close it in the mapper close method. > Then you end up with 1 sequence file per map. I am making an assumption > that > each key,value to your map some how represents a single xml file/item. > > On Wed, Jun 17, 2009 at 7:29 PM, Jothi Padmanabhan <joth...@yahoo-inc.com > >wrote: > > > You could look at CombineFileInputFormat to generate a single split out > of > > several files. > > > > Your partitioner would be able to assign keys to specific reducers, but > you > > would not have control on which node a given reduce task will run. > > > > Jothi > > > > > > On 6/18/09 5:10 AM, "Tarandeep Singh" <tarand...@gmail.com> wrote: > > > > > Hi, > > > > > > Can I restrict the output of mappers running on a node to go to > > reducer(s) > > > running on the same node? > > > > > > Let me explain why I want to do this- > > > > > > I am converting huge number of XML files into SequenceFiles. So > > > theoretically I don't even need reducers, mappers would read xml files > > and > > > output Sequencefiles. But the problem with this approach is I will end > up > > > getting huge number of small output files. > > > > > > To avoid generating large number of smaller files, I can Identity > > reducers. > > > But by running reducers, I am unnecessarily transfering data over > > network. I > > > ran some test case using a small subset of my data (~90GB). With map > only > > > jobs, my cluster finished conversion in only 6 minutes. But with map > and > > > Identity reducers job, it takes around 38 minutes. > > > > > > I have to process close to a terabyte of data. So I was thinking of a > > faster > > > alternatives- > > > > > > * Writing a custom OutputFormat > > > * Somehow restrict output of mappers running on a node to go to > reducers > > > running on the same node. May be I can write my own partitioner > (simple) > > but > > > not sure how Hadoop's framework assigns partitions to reduce tasks. > > > > > > Any pointers ? > > > > > > Or this is not possible at all ? > > > > > > Thanks, > > > Tarandeep > > > > > > > -- > Pro Hadoop, a book to guide you from beginner to hadoop mastery, > http://www.amazon.com/dp/1430219424?tag=jewlerymall > www.prohadoopbook.com a community for Hadoop Professionals >