Re: How to select random n records using mapreduce ?

2011-06-27 Thread David Rosenstrauch
Building on this, you could do something like the following to make it more random: if (numRecordsWritten < NUM_RECORDS_DESIRED) { int n = generateARandomNumberBetween1and100(); if (n == 100) { context.write(key, value); } } The above would somewhat rando

Re: How to select random n records using mapreduce ?

2011-06-27 Thread Anthony Urso
On Mon, Jun 27, 2011 at 12:11 AM, Jeff Zhang wrote: > > Hi all, > I'd like to select random N records from a large amount of data using > hadoop, just wonder how can I archive this ? Currently my idea is that let > each mapper task select N / mapper_number records. Does anyone has such > experien

Re: How to select random n records using mapreduce ?

2011-06-27 Thread Niels Basjes
The only solution I can think of is by creating a counter in Hadoop that is incremented each time a mapper lets a record through. As soon as the value reaches a preselected value the mappers simply discard the additional input they receive. Note that this will not at all be random yet it's the

How to select random n records using mapreduce ?

2011-06-27 Thread Jeff Zhang
Hi all, I'd like to select random N records from a large amount of data using hadoop, just wonder how can I archive this ? Currently my idea is that let each mapper task select N / mapper_number records. Does anyone has such experience ? -- Best Regards Jeff Zhang