Re: Order By Sampling

2011-05-06 Thread Thejas M Nair
The sampling algorithm for order-by samples 100 records from every map task, using a reservoir sampling algorithm. I can't think of a way to store data that could adversely affect this sampling. This is the class (a pig load function) that is involved in sampling - org.apache.pig.impl.builtin.Ran

Order By Sampling

2011-05-04 Thread Brock Noland
Hello, I am curious as to how PIG implements sampling for order by: http://ofps.oreilly.com/titles/9781449302641/intro_pig_latin.html#order_by Are there things I could when storing my data which would adversely affect this sampling? Thanks, Brock