So it seems reasonable to me for users to expect that min/max/sort give consistent answers, even though the backing implementations of each are different.
I *think* that the right way to do this is to have some sort of notion of a PType that knows whether the type it refers to is Comparable and has a method that consistently cmps elements of that type, either as Java types or as serialized (Writable/Avro) types, and then we make the sort and min/max APIs make use of that functionality. Then the assumption would be that the min/max/sort APIs would be consistent _assuming_ that the PCollection had the same associated PType when the min/max/sort method was called. J On Fri, Sep 14, 2012 at 6:33 AM, Rahul <[email protected]> wrote: > Hi all, > > We have min/max/sort APIs in Crunch. The min and max rely on S(user type) > being comparable while the Sort API relies on the corresponding writable > type being comparable i. WritableComparable. To me the min and max API > are special cases of Sort API and the three should be in sync with each > other. If this is not the case then at-least theoretically we could have > cases where sorting produces results that are different from min/max > functions. We could adopt the Sort approach for all three but there are > some issues in that api like if the Writable is not comparable then the > error will not be that clear, S could have a comparator that is different > from the Writable then the results are not as expected by user etc. Or > maybe we can use comparable S in Sort api, I am not sure, but I think we > would not be able to use hadoop shuffle and sort then. I do not have > complete idea how we could make the three in sync. Any thoughts on the same > ? But I would like to ask first should we even try to to do that ? or I am > just cooking some theory and this has no practical use case. There has been > some discussion on this in CRUNCH-57 <https://issues.apache.org/** > jira/browse/CRUNCH-57 <https://issues.apache.org/jira/browse/CRUNCH-57>> > issue. Let me know what you think. > > regards, > Rahul > > > -- Director of Data Science Cloudera <http://www.cloudera.com> Twitter: @josh_wills <http://twitter.com/josh_wills>
