The last time I had to do something like this, in the map phase, I made the key a random value, md5 of the key, and built a new value that had the real key embedded.
Then in the reduce phase I received the records in random order and could do what I wanted. By using a stable but differently sorting value for the key, my reduce still grouped correctly but I received the calls to reduce in a random order compared to the normal sort order of the data. On Thu, May 21, 2009 at 12:25 PM, Alex Loddengaard <a...@cloudera.com>wrote: > Bhupesh, > > I forgot to say that the concatenation phase of my plan would concatenate > randomly. As I mentioned, this wouldn't be a good way to randomize, but > it'd be pretty easy. > > Anyway, your solution is much more clever and does a better job > randomizing. Good thinking! > > Thanks, > > Alex > > On Thu, May 21, 2009 at 11:36 AM, Bhupesh Bansal <bban...@linkedin.com > >wrote: > > > Hmm , > > > > IMHO running a mapper only job will give you an output file > > With same order. You should write a custom map-reduce job > > Where map emits (Key:Integer.random() , value:line) > > And reducer Output (key:NOTHING , value:line) > > > > Reducer will sort on Integer.random() giving you a random ordering for > your > > input file. > > > > Best > > Bhupesh > > > > > > On 5/21/09 11:15 AM, "Alex Loddengaard" <a...@cloudera.com> wrote: > > > > > Hi John, > > > > > > I don't know of a built-in way to do this. Depending on how well you > > want > > > to randomize, you could just run a MapReduce job with at least one map > > (the > > > more maps, the more random) and no reduces. When you run a job with no > > > reduces, the shuffle phase is skipped entirely, and the intermediate > > outputs > > > from the mappers are stored directly to HDFS. Though I think each > mapper > > > will create one HDFS file, so you'll have to concatenate all files into > a > > > single file. > > > > > > The above isn't a very good way to randomize, but it's fairly easy to > > > implement and should run pretty quickly. > > > > > > Hope this helps. > > > > > > Alex > > > > > > On Thu, May 21, 2009 at 7:18 AM, John Clarke <clarke...@gmail.com> > > wrote: > > > > > >> Hi, > > >> > > >> I have a need to randomize my input file before processing. I > understand > > I > > >> can chain Hadoop jobs together so the first could take the input file > > >> randomize it and then the second could take the randomized file and do > > the > > >> processing. > > >> > > >> The input file has one entry per line and I want to mix up the lines > > before > > >> the main processing. > > >> > > >> Is there an inbuilt ability I have missed or will I have to try and > > write a > > >> Hadoop program to shuffle my input file? > > >> > > >> Cheers, > > >> John > > >> > > > > > -- Alpha Chapters of my book on Hadoop are available http://www.apress.com/book/view/9781430219422 www.prohadoopbook.com a community for Hadoop Professionals