Thanks, that's a clever approach. I'll implement something like this. Cheers, John
2009/5/21 Bhupesh Bansal <bban...@linkedin.com> > Hmm , > > IMHO running a mapper only job will give you an output file > With same order. You should write a custom map-reduce job > Where map emits (Key:Integer.random() , value:line) > And reducer Output (key:NOTHING , value:line) > > Reducer will sort on Integer.random() giving you a random ordering for your > input file. > > Best > Bhupesh > > > On 5/21/09 11:15 AM, "Alex Loddengaard" <a...@cloudera.com> wrote: > > > Hi John, > > > > I don't know of a built-in way to do this. Depending on how well you > want > > to randomize, you could just run a MapReduce job with at least one map > (the > > more maps, the more random) and no reduces. When you run a job with no > > reduces, the shuffle phase is skipped entirely, and the intermediate > outputs > > from the mappers are stored directly to HDFS. Though I think each mapper > > will create one HDFS file, so you'll have to concatenate all files into a > > single file. > > > > The above isn't a very good way to randomize, but it's fairly easy to > > implement and should run pretty quickly. > > > > Hope this helps. > > > > Alex > > > > On Thu, May 21, 2009 at 7:18 AM, John Clarke <clarke...@gmail.com> > wrote: > > > >> Hi, > >> > >> I have a need to randomize my input file before processing. I understand > I > >> can chain Hadoop jobs together so the first could take the input file > >> randomize it and then the second could take the randomized file and do > the > >> processing. > >> > >> The input file has one entry per line and I want to mix up the lines > before > >> the main processing. > >> > >> Is there an inbuilt ability I have missed or will I have to try and > write a > >> Hadoop program to shuffle my input file? > >> > >> Cheers, > >> John > >> > >