How to select random n records using mapreduce ?
Hi all, I'd like to select random N records from a large amount of data using hadoop, just wonder how can I archive this ? Currently my idea is that let each mapper task select N / mapper_number records. Does anyone has such experience ? -- Best Regards Jeff Zhang
RE: How to select random n records using mapreduce ?
I did something similar. Basically I had a random sampling algorithm that I called from the mapper. If it returned true I would collect the data, otherwise I would discard it. Bill -Original Message- From: ni...@basj.es [mailto:ni...@basj.es] On Behalf Of Niels Basjes Sent: Monday, June 27, 2011 3:29 PM To: mapreduce-u...@hadoop.apache.org Cc: core-u...@hadoop.apache.org Subject: Re: How to select random n records using mapreduce ? The only solution I can think of is by creating a counter in Hadoop that is incremented each time a mapper lets a record through. As soon as the value reaches a preselected value the mappers simply discard the additional input they receive. Note that this will not at all be random yet it's the best I can come up with right now. HTH On Mon, Jun 27, 2011 at 09:11, Jeff Zhang zjf...@gmail.com wrote: Hi all, I'd like to select random N records from a large amount of data using hadoop, just wonder how can I archive this ? Currently my idea is that let each mapper task select N / mapper_number records. Does anyone has such experience ? -- Best Regards Jeff Zhang -- Best regards / Met vriendelijke groeten, Niels Basjes
RE: How to select random n records using mapreduce ?
Wait - Habermaas like in Critical Theory -Original Message- From: Habermaas, William [mailto:william.haberm...@fatwire.com] Sent: Monday, June 27, 2011 2:55 PM To: common-user@hadoop.apache.org Subject: RE: How to select random n records using mapreduce ? I did something similar. Basically I had a random sampling algorithm that I called from the mapper. If it returned true I would collect the data, otherwise I would discard it. Bill -Original Message- From: ni...@basj.es [mailto:ni...@basj.es] On Behalf Of Niels Basjes Sent: Monday, June 27, 2011 3:29 PM To: mapreduce-u...@hadoop.apache.org Cc: core-u...@hadoop.apache.org Subject: Re: How to select random n records using mapreduce ? The only solution I can think of is by creating a counter in Hadoop that is incremented each time a mapper lets a record through. As soon as the value reaches a preselected value the mappers simply discard the additional input they receive. Note that this will not at all be random yet it's the best I can come up with right now. HTH On Mon, Jun 27, 2011 at 09:11, Jeff Zhang zjf...@gmail.com wrote: Hi all, I'd like to select random N records from a large amount of data using hadoop, just wonder how can I archive this ? Currently my idea is that let each mapper task select N / mapper_number records. Does anyone has such experience ? -- Best Regards Jeff Zhang -- Best regards / Met vriendelijke groeten, Niels Basjes
Re: How to select random n records using mapreduce ?
If the incoming data is unique you can create a hash of the data and then do a modulus of the hash to select a random set. So if you wanted 10% of the data randomly: hash % 10 == 0 Gives a random 10% On 6/27/11 12:54 PM, Habermaas, William william.haberm...@fatwire.com wrote: I did something similar. Basically I had a random sampling algorithm that I called from the mapper. If it returned true I would collect the data, otherwise I would discard it. Bill -Original Message- From: ni...@basj.es [mailto:ni...@basj.es] On Behalf Of Niels Basjes Sent: Monday, June 27, 2011 3:29 PM To: mapreduce-u...@hadoop.apache.org Cc: core-u...@hadoop.apache.org Subject: Re: How to select random n records using mapreduce ? The only solution I can think of is by creating a counter in Hadoop that is incremented each time a mapper lets a record through. As soon as the value reaches a preselected value the mappers simply discard the additional input they receive. Note that this will not at all be random yet it's the best I can come up with right now. HTH On Mon, Jun 27, 2011 at 09:11, Jeff Zhang zjf...@gmail.com wrote: Hi all, I'd like to select random N records from a large amount of data using hadoop, just wonder how can I archive this ? Currently my idea is that let each mapper task select N / mapper_number records. Does anyone has such experience ? -- Best Regards Jeff Zhang iCrossing Privileged and Confidential Information This email message is for the sole use of the intended recipient(s) and may contain confidential and privileged information of iCrossing. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message.