The sampling algorithm for order-by samples 100 records from every map task,
using a reservoir sampling algorithm.
I can't think of a way to store data that could adversely affect this sampling.
This is the class (a pig load function) that is involved in sampling -
org.apache.pig.impl.builtin.Ran
Hello,
I am curious as to how PIG implements sampling for order by:
http://ofps.oreilly.com/titles/9781449302641/intro_pig_latin.html#order_by
Are there things I could when storing my data which would adversely
affect this sampling?
Thanks,
Brock