Re: Randomize input file?

Bhupesh Bansal Thu, 21 May 2009 11:37:00 -0700

Hmm , 

IMHO running a mapper only job will give you an output file
With same order. You should write a custom map-reduce job
Where map emits (Key:Integer.random() , value:line)
And reducer Output (key:NOTHING , value:line)


Reducer will sort on Integer.random() giving you a random ordering for your
input file. 

Best
Bhupesh


On 5/21/09 11:15 AM, "Alex Loddengaard" <a...@cloudera.com> wrote:

> Hi John,
> 
> I don't know of a built-in way to do this.  Depending on how well you want
> to randomize, you could just run a MapReduce job with at least one map (the
> more maps, the more random) and no reduces.  When you run a job with no
> reduces, the shuffle phase is skipped entirely, and the intermediate outputs
> from the mappers are stored directly to HDFS.  Though I think each mapper
> will create one HDFS file, so you'll have to concatenate all files into a
> single file.
> 
> The above isn't a very good way to randomize, but it's fairly easy to
> implement and should run pretty quickly.
> 
> Hope this helps.
> 
> Alex
> 
> On Thu, May 21, 2009 at 7:18 AM, John Clarke <clarke...@gmail.com> wrote:
> 
>> Hi,
>> 
>> I have a need to randomize my input file before processing. I understand I
>> can chain Hadoop jobs together so the first could take the input file
>> randomize it and then the second could take the randomized file and do the
>> processing.
>> 
>> The input file has one entry per line and I want to mix up the lines before
>> the main processing.
>> 
>> Is there an inbuilt ability I have missed or will I have to try and write a
>> Hadoop program to shuffle my input file?
>> 
>> Cheers,
>> John
>>

Re: Randomize input file?

Reply via email to