How to select random n records using mapreduce ?

2011-06-27 Thread Jeff Zhang
Hi all,

I'd like to select random N records from a large amount of data using
hadoop, just wonder how can I archive this ? Currently my idea is that let
each mapper task select N / mapper_number records. Does anyone has such
experience ?


-- 
Best Regards

Jeff Zhang


RE: How to select random n records using mapreduce ?

2011-06-27 Thread Habermaas, William
I did something similar.  Basically I had a random sampling algorithm that I 
called from the mapper. If it returned true I would collect the data, otherwise 
I would discard it. 

Bill 

-Original Message-
From: ni...@basj.es [mailto:ni...@basj.es] On Behalf Of Niels Basjes
Sent: Monday, June 27, 2011 3:29 PM
To: mapreduce-u...@hadoop.apache.org
Cc: core-u...@hadoop.apache.org
Subject: Re: How to select random n records using mapreduce ?

The only solution I can think of is by creating a counter in Hadoop
that is incremented each time a mapper lets a record through.
As soon as the value reaches a preselected value the mappers simply
discard the additional input they receive.

Note that this will not at all be random yet it's the best I can
come up with right now.

HTH

On Mon, Jun 27, 2011 at 09:11, Jeff Zhang zjf...@gmail.com wrote:

 Hi all,
 I'd like to select random N records from a large amount of data using
 hadoop, just wonder how can I archive this ? Currently my idea is that let
 each mapper task select N / mapper_number records. Does anyone has such
 experience ?

 --
 Best Regards

 Jeff Zhang




-- 
Best regards / Met vriendelijke groeten,

Niels Basjes



RE: How to select random n records using mapreduce ?

2011-06-27 Thread Jeff.Schmitz
Wait - Habermaas like in Critical Theory

-Original Message-
From: Habermaas, William [mailto:william.haberm...@fatwire.com] 
Sent: Monday, June 27, 2011 2:55 PM
To: common-user@hadoop.apache.org
Subject: RE: How to select random n records using mapreduce ?

I did something similar.  Basically I had a random sampling algorithm
that I called from the mapper. If it returned true I would collect the
data, otherwise I would discard it. 

Bill 

-Original Message-
From: ni...@basj.es [mailto:ni...@basj.es] On Behalf Of Niels Basjes
Sent: Monday, June 27, 2011 3:29 PM
To: mapreduce-u...@hadoop.apache.org
Cc: core-u...@hadoop.apache.org
Subject: Re: How to select random n records using mapreduce ?

The only solution I can think of is by creating a counter in Hadoop
that is incremented each time a mapper lets a record through.
As soon as the value reaches a preselected value the mappers simply
discard the additional input they receive.

Note that this will not at all be random yet it's the best I can
come up with right now.

HTH

On Mon, Jun 27, 2011 at 09:11, Jeff Zhang zjf...@gmail.com wrote:

 Hi all,
 I'd like to select random N records from a large amount of data using
 hadoop, just wonder how can I archive this ? Currently my idea is that
let
 each mapper task select N / mapper_number records. Does anyone has
such
 experience ?

 --
 Best Regards

 Jeff Zhang




-- 
Best regards / Met vriendelijke groeten,

Niels Basjes




Re: How to select random n records using mapreduce ?

2011-06-27 Thread Matt Pouttu-Clarke
If the incoming data is unique you can create a hash of the data and then do
a modulus of the hash to select a random set.  So if you wanted 10% of the
data randomly:

hash % 10 == 0

Gives a random 10%


On 6/27/11 12:54 PM, Habermaas, William william.haberm...@fatwire.com
wrote:

 I did something similar.  Basically I had a random sampling algorithm that I
 called from the mapper. If it returned true I would collect the data,
 otherwise I would discard it.
 
 Bill 
 
 -Original Message-
 From: ni...@basj.es [mailto:ni...@basj.es] On Behalf Of Niels Basjes
 Sent: Monday, June 27, 2011 3:29 PM
 To: mapreduce-u...@hadoop.apache.org
 Cc: core-u...@hadoop.apache.org
 Subject: Re: How to select random n records using mapreduce ?
 
 The only solution I can think of is by creating a counter in Hadoop
 that is incremented each time a mapper lets a record through.
 As soon as the value reaches a preselected value the mappers simply
 discard the additional input they receive.
 
 Note that this will not at all be random yet it's the best I can
 come up with right now.
 
 HTH
 
 On Mon, Jun 27, 2011 at 09:11, Jeff Zhang zjf...@gmail.com wrote:
 
 Hi all,
 I'd like to select random N records from a large amount of data using
 hadoop, just wonder how can I archive this ? Currently my idea is that let
 each mapper task select N / mapper_number records. Does anyone has such
 experience ?
 
 --
 Best Regards
 
 Jeff Zhang
 
 
 


iCrossing Privileged and Confidential Information
This email message is for the sole use of the intended recipient(s) and may 
contain confidential and privileged information of iCrossing. Any unauthorized 
review, use, disclosure or distribution is prohibited. If you are not the 
intended recipient, please contact the sender by reply email and destroy all 
copies of the original message.