Thanks Lance. If I understand you correctly you're proposing the following:

Map: (K1,V1) -> (K2,V2)
  V2 = V1
  K2 = hashcode(K1)
  emit(K2,V2)

Combine: (K2,V2) -> (K3,V3)
(e.g. if we want to keep 10% of samples)
  if ( ! K2%10 ) {
    V3 = V2
    K3 = K2
    emit(K3, V3)
  }

Reduce: (K3,V3)
  SequenceFile.Writer.append(K3,V3)

Is that correct?

Also I'm wondering if we can do downsampling at the mapper? Would that be
more efficient?

Thanks

On Thu, Dec 8, 2011 at 3:05 PM, Lance Norskog <[email protected]> wrote:

> To get random sampling and sorting:
>
> Generate a hashcode from each of your "real" keys, then map on the hashcode
> instead. This gives a random sort. Make the reducer do a modulo at the
> beginning of the method and return without writing anything. Now, make the
> reducer a combiner also. Now, only your desired subset of samples goes
> across the wire. Each real reducer only gets one live sample, so just save
> it. You now have a randomly sorted and sampled output. Use a Partitioner or
> just one reducer, depending on size.
>
> This is deterministic. To get a different random set each time, munge each
> hashcode with a random number.

Reply via email to