Thanks for answering Daniel!
But there are a couple of things I don't quite get. For example, you say
that each mapper will be reading just configured amount of rows, but
wouldn't the mappers end up reading the whole file? And my other question is
if this is implemented, do you know in which classes this is?
Thanks in advance.

Renato M.


2011/3/7 Daniel Dai <jiany...@yahoo-inc.com>

> Sampling job in Pig is used in "order by" and "skewed join". It will be
> translated to a single map-reduce job. In the map, we sample the data with a
> configurable interval; in the reduce, we do a "group all" followed by a
> nested foreach. Within foreach, we do a nested sort and then feed the result
> to UDF ("order by" and "skewed join" use different UDF)
>
> In PIG-1038, we will optimize nested sort using hadoop secondary sort if
> possible. Sampling job fits in the bill. So PIG-841 is fixed automatically.
>
> Daniel
>
>
> On 03/05/2011 12:54 PM, Renato Marroquín Mogrovejo wrote:
>
>> Hey does anybody know if PIG-841 was developed? And if it was, how is it
>> being used by Pig?
>> Thanks in advance.
>>
>> Renato M.
>>
>
>

Reply via email to