Thanks Lance. If I understand you correctly you're proposing the following:
Map: (K1,V1) -> (K2,V2)
V2 = V1
K2 = hashcode(K1)
emit(K2,V2)
Combine: (K2,V2) -> (K3,V3)
(e.g. if we want to keep 10% of samples)
if ( ! K2%10 ) {
V3 = V2
K3 = K2
emit(K3, V3)
}
Reduce: (K3,V3)
SequenceFile.Writer.append(K3,V3)
Is that correct?
Also I'm wondering if we can do downsampling at the mapper? Would that be
more efficient?
Thanks
On Thu, Dec 8, 2011 at 3:05 PM, Lance Norskog <[email protected]> wrote:
> To get random sampling and sorting:
>
> Generate a hashcode from each of your "real" keys, then map on the hashcode
> instead. This gives a random sort. Make the reducer do a modulo at the
> beginning of the method and return without writing anything. Now, make the
> reducer a combiner also. Now, only your desired subset of samples goes
> across the wire. Each real reducer only gets one live sample, so just save
> it. You now have a randomly sorted and sampled output. Use a Partitioner or
> just one reducer, depending on size.
>
> This is deterministic. To get a different random set each time, munge each
> hashcode with a random number.