Hi John, I don't know of a built-in way to do this. Depending on how well you want to randomize, you could just run a MapReduce job with at least one map (the more maps, the more random) and no reduces. When you run a job with no reduces, the shuffle phase is skipped entirely, and the intermediate outputs from the mappers are stored directly to HDFS. Though I think each mapper will create one HDFS file, so you'll have to concatenate all files into a single file.
The above isn't a very good way to randomize, but it's fairly easy to implement and should run pretty quickly. Hope this helps. Alex On Thu, May 21, 2009 at 7:18 AM, John Clarke <clarke...@gmail.com> wrote: > Hi, > > I have a need to randomize my input file before processing. I understand I > can chain Hadoop jobs together so the first could take the input file > randomize it and then the second could take the randomized file and do the > processing. > > The input file has one entry per line and I want to mix up the lines before > the main processing. > > Is there an inbuilt ability I have missed or will I have to try and write a > Hadoop program to shuffle my input file? > > Cheers, > John >