On Wed, Dec 14, 2011 at 1:01 PM, Raphael Cendrillon <
[email protected]> wrote:
> Thanks Lance. If I understand you correctly you're proposing the following:
>
> Map: (K1,V1) -> (K2,V2)
> V2 = V1
> K2 = hashcode(K1)
>
Preserving K1 may be important. In that case you may prefer
> emit(K2,V2)
>
emit(K2, [K1, V1])
>
> Combine: (K2,V2) -> (K3,V3)
> (e.g. if we want to keep 10% of samples)
> if ( ! K2%10 ) {
>
Why not keep this in the mapper?
>
> Also I'm wondering if we can do downsampling at the mapper? Would that be
> more efficient?
>
Yes. It would be.