Re: Randomize input file?

John Clarke Wed, 27 May 2009 01:56:40 -0700

Thanks, that's a clever approach. I'll implement something like this.

Cheers,
John



2009/5/21 Bhupesh Bansal <bban...@linkedin.com>

> Hmm ,
>
> IMHO running a mapper only job will give you an output file
> With same order. You should write a custom map-reduce job
> Where map emits (Key:Integer.random() , value:line)
> And reducer Output (key:NOTHING , value:line)
>
> Reducer will sort on Integer.random() giving you a random ordering for your
> input file.
>
> Best
> Bhupesh
>
>
> On 5/21/09 11:15 AM, "Alex Loddengaard" <a...@cloudera.com> wrote:
>
> > Hi John,
> >
> > I don't know of a built-in way to do this.  Depending on how well you
> want
> > to randomize, you could just run a MapReduce job with at least one map
> (the
> > more maps, the more random) and no reduces.  When you run a job with no
> > reduces, the shuffle phase is skipped entirely, and the intermediate
> outputs
> > from the mappers are stored directly to HDFS.  Though I think each mapper
> > will create one HDFS file, so you'll have to concatenate all files into a
> > single file.
> >
> > The above isn't a very good way to randomize, but it's fairly easy to
> > implement and should run pretty quickly.
> >
> > Hope this helps.
> >
> > Alex
> >
> > On Thu, May 21, 2009 at 7:18 AM, John Clarke <clarke...@gmail.com>
> wrote:
> >
> >> Hi,
> >>
> >> I have a need to randomize my input file before processing. I understand
> I
> >> can chain Hadoop jobs together so the first could take the input file
> >> randomize it and then the second could take the randomized file and do
> the
> >> processing.
> >>
> >> The input file has one entry per line and I want to mix up the lines
> before
> >> the main processing.
> >>
> >> Is there an inbuilt ability I have missed or will I have to try and
> write a
> >> Hadoop program to shuffle my input file?
> >>
> >> Cheers,
> >> John
> >>
>
>

Re: Randomize input file?

Reply via email to