If there is just one reducer there is no need for sampling (PIG-2784), but when there is more than one reducer in order by you need to sample the data and determine the partition ranges so that you can do a Distributed Orderby.
Regards, Rohini On Thu, May 22, 2014 at 10:37 AM, Ruoyu Liu <[email protected]> wrote: > Hi all, > > I’m looking at the execution process of several operations and have a > question may be naive and hope that someone can help me. > For the operations like Ordey by, why do we use an extra MR job to sample > the data? But in java version implementation, we can always use on MR job > to implement the operation. > > Thank you for your time!! > > Best, > Ruoyu
