Re: "order by" and "distinct" in one job?

Rohini Palaniswamy Mon, 08 Jun 2015 13:10:46 -0700

If order by and distinct have the same key, it is possible to combine them
into one mapreduce job.  But the current distributed order by uses range
partitioning and same keys can go to different reducers. Tagging along
distinct to that will require more work and not something we are planning
to do sometime soon.


On Wed, Jun 3, 2015 at 11:14 AM, Mehmet Tepedelenlioglu <
[email protected]> wrote:

> Order and distinct are 2 very different operations. You order by
> something, but you take the distinct over all the fields of a relation,
> which is to say that the key/value structure is quite different for the
> general case.
>
>
> > On Jun 3, 2015, at 11:02 AM, <[email protected]> <
> [email protected]> wrote:
> >
> > Dear Pig users,
> > Can Pig combine sorting and unique-ing into a single job?  Doing this
> > --define Components, then
> > Sorted_0 = order Components by block_id parallel $par;
> > Sorted = DISTINCT Sorted_0;
> >
> > causes one more MR job to be launched than simply doing this:
> > --define Components, then
> > Sorted = order Components by block_id parallel $par;
> >
> > It would seem there should be some way to do the distinct in the same
> pass as the sort, like 'sort -u'.  But I can't see how. Any tips would be
> much appreciated!
> >
> > Thanks,
> > Will
> >
> > William F Dowling
> > Senior Technologist
> > Thomson Reuters
> >
>
>

Re: "order by" and "distinct" in one job?

Reply via email to