This is a great question!

I could be wrong, but I don't believe there is a way to indicate this for a
group-by. It definitely does matter for performance if your input is
globally sorted. Currently a group by happens on reduce side. But if the
input is globally sorted, this can happen map side for a significant
performance boost.

I did see a CollectableLoadFunc
<http://pig.apache.org/docs/r0.13.0/api/org/apache/pig/CollectableLoadFunc.html>
interface that's used in the MergeJoin algorithm... I don't see why this
couldn't be used for a map side group by also.

On Sun, Oct 12, 2014 at 11:48 PM, Sunil S Nandihalli <
sunil.nandiha...@gmail.com> wrote:

> Hi Everybody,
>  Is there a way to indicate that the data is sorted by the key using which
> the relations are being grouped? Or does it even matter for performance
> whether we indicate it or not?
> Thanks,
> Sunil.
>

Reply via email to