Re: Randomize input file?

Alex Loddengaard Thu, 21 May 2009 11:15:41 -0700

Hi John,

I don't know of a built-in way to do this.  Depending on how well you want
to randomize, you could just run a MapReduce job with at least one map (the
more maps, the more random) and no reduces.  When you run a job with no
reduces, the shuffle phase is skipped entirely, and the intermediate outputs
from the mappers are stored directly to HDFS.  Though I think each mapper
will create one HDFS file, so you'll have to concatenate all files into a
single file.


The above isn't a very good way to randomize, but it's fairly easy to
implement and should run pretty quickly.

Hope this helps.

Alex

On Thu, May 21, 2009 at 7:18 AM, John Clarke <clarke...@gmail.com> wrote:

> Hi,
>
> I have a need to randomize my input file before processing. I understand I
> can chain Hadoop jobs together so the first could take the input file
> randomize it and then the second could take the randomized file and do the
> processing.
>
> The input file has one entry per line and I want to mix up the lines before
> the main processing.
>
> Is there an inbuilt ability I have missed or will I have to try and write a
> Hadoop program to shuffle my input file?
>
> Cheers,
> John
>

Re: Randomize input file?

Reply via email to