Hmm , IMHO running a mapper only job will give you an output file With same order. You should write a custom map-reduce job Where map emits (Key:Integer.random() , value:line) And reducer Output (key:NOTHING , value:line)
Reducer will sort on Integer.random() giving you a random ordering for your input file. Best Bhupesh On 5/21/09 11:15 AM, "Alex Loddengaard" <a...@cloudera.com> wrote: > Hi John, > > I don't know of a built-in way to do this. Depending on how well you want > to randomize, you could just run a MapReduce job with at least one map (the > more maps, the more random) and no reduces. When you run a job with no > reduces, the shuffle phase is skipped entirely, and the intermediate outputs > from the mappers are stored directly to HDFS. Though I think each mapper > will create one HDFS file, so you'll have to concatenate all files into a > single file. > > The above isn't a very good way to randomize, but it's fairly easy to > implement and should run pretty quickly. > > Hope this helps. > > Alex > > On Thu, May 21, 2009 at 7:18 AM, John Clarke <clarke...@gmail.com> wrote: > >> Hi, >> >> I have a need to randomize my input file before processing. I understand I >> can chain Hadoop jobs together so the first could take the input file >> randomize it and then the second could take the randomized file and do the >> processing. >> >> The input file has one entry per line and I want to mix up the lines before >> the main processing. >> >> Is there an inbuilt ability I have missed or will I have to try and write a >> Hadoop program to shuffle my input file? >> >> Cheers, >> John >>