sort APIs not in sync

Josh Wills Fri, 14 Sep 2012 09:23:25 -0700

So it seems reasonable to me for users to expect that min/max/sort give
consistent answers, even though the backing implementations of each are
different.


I *think* that the right way to do this is to have some sort of notion of a
PType that knows whether the type it refers to is Comparable and has a
method that consistently cmps elements of that type, either as Java types
or as serialized (Writable/Avro) types, and then we make the sort and
min/max APIs make use of that functionality. Then the assumption would be
that the min/max/sort APIs would be consistent _assuming_ that the
PCollection had the same associated PType when the min/max/sort method was
called.

J

On Fri, Sep 14, 2012 at 6:33 AM, Rahul <[email protected]> wrote:

> Hi all,
>
> We have min/max/sort APIs in Crunch. The min and max rely on S(user type)
> being comparable while the Sort API relies on the corresponding writable
> type being comparable i. WritableComparable.   To me the min and max API
> are special cases of Sort API and the three should be in sync with each
> other.  If this is not the case then at-least theoretically we could have
> cases where sorting produces results that are different from min/max
> functions. We could adopt the Sort approach for all three but there are
> some issues in that api like if the Writable is not comparable then the
> error will not be that clear,  S could have a comparator that is different
> from the Writable then the results are not as expected by user etc. Or
> maybe we can use comparable S in Sort api, I am not sure, but I think we
> would not be able to use hadoop shuffle and sort then.  I do not have
> complete idea how we could make the three in sync. Any thoughts on the same
> ? But I would like to ask first should we even try to to do that ? or  I am
> just cooking some theory and this has no practical use case. There has been
> some discussion on this in CRUNCH-57 <https://issues.apache.org/**
> jira/browse/CRUNCH-57 <https://issues.apache.org/jira/browse/CRUNCH-57>>
> issue. Let me know what you think.
>
> regards,
> Rahul
>
>
>


-- 
Director of Data Science
Cloudera <http://www.cloudera.com>
Twitter: @josh_wills <http://twitter.com/josh_wills>

Re: min/max/sort APIs not in sync

Reply via email to