Re: Randomize input file?

jason hadoop Thu, 21 May 2009 19:49:43 -0700

The last time I had to do something like this, in the map phase, I made the
key a random value, md5 of the key, and
built a new value that had the real key embedded.


Then in the reduce phase I received the records in random order and could do
what I wanted.
By using a stable but differently sorting value for the key, my reduce still
grouped correctly but I received the calls to reduce in a random order
compared to the normal sort order of the data.



On Thu, May 21, 2009 at 12:25 PM, Alex Loddengaard <a...@cloudera.com>wrote:

> Bhupesh,
>
> I forgot to say that the concatenation phase of my plan would concatenate
> randomly.  As I mentioned, this wouldn't be a good way to randomize, but
> it'd be pretty easy.
>
> Anyway, your solution is much more clever and does a better job
> randomizing.  Good thinking!
>
> Thanks,
>
> Alex
>
> On Thu, May 21, 2009 at 11:36 AM, Bhupesh Bansal <bban...@linkedin.com
> >wrote:
>
> > Hmm ,
> >
> > IMHO running a mapper only job will give you an output file
> > With same order. You should write a custom map-reduce job
> > Where map emits (Key:Integer.random() , value:line)
> > And reducer Output (key:NOTHING , value:line)
> >
> > Reducer will sort on Integer.random() giving you a random ordering for
> your
> > input file.
> >
> > Best
> > Bhupesh
> >
> >
> > On 5/21/09 11:15 AM, "Alex Loddengaard" <a...@cloudera.com> wrote:
> >
> > > Hi John,
> > >
> > > I don't know of a built-in way to do this.  Depending on how well you
> > want
> > > to randomize, you could just run a MapReduce job with at least one map
> > (the
> > > more maps, the more random) and no reduces.  When you run a job with no
> > > reduces, the shuffle phase is skipped entirely, and the intermediate
> > outputs
> > > from the mappers are stored directly to HDFS.  Though I think each
> mapper
> > > will create one HDFS file, so you'll have to concatenate all files into
> a
> > > single file.
> > >
> > > The above isn't a very good way to randomize, but it's fairly easy to
> > > implement and should run pretty quickly.
> > >
> > > Hope this helps.
> > >
> > > Alex
> > >
> > > On Thu, May 21, 2009 at 7:18 AM, John Clarke <clarke...@gmail.com>
> > wrote:
> > >
> > >> Hi,
> > >>
> > >> I have a need to randomize my input file before processing. I
> understand
> > I
> > >> can chain Hadoop jobs together so the first could take the input file
> > >> randomize it and then the second could take the randomized file and do
> > the
> > >> processing.
> > >>
> > >> The input file has one entry per line and I want to mix up the lines
> > before
> > >> the main processing.
> > >>
> > >> Is there an inbuilt ability I have missed or will I have to try and
> > write a
> > >> Hadoop program to shuffle my input file?
> > >>
> > >> Cheers,
> > >> John
> > >>
> >
> >
>



-- 
Alpha Chapters of my book on Hadoop are available
http://www.apress.com/book/view/9781430219422
www.prohadoopbook.com a community for Hadoop Professionals

Re: Randomize input file?

Reply via email to