On Sat, Sep 15, 2012 at 6:39 AM, Rahul <[email protected]> wrote: > I agree that it is hard to find use cases for this besides the standard > types. I can say that in our application, which is on hadoop-MR and not on > crunch, we use sorting on strings but examples for min/max are hard to > find. If I think of moving my application on crunch, sorting on custom type > is the only thing that we would require. I think I should be able to bypass > that constraint by making use of PTable<String, CustomObject>. > > @Josh, I also think that PType could be the starting point. Maybe we can > embed this information in Convertor that is used there. As long as the user > uses the same PType results would remain in sync. > But again it comes back to the point Gabriel mentioned are there use cases > that go beyond standard types ? >
Maybe something involving composites of numeric types (e.g., Tuple3<Double, Double, Double> or some such thing) but that's the only non-primitive type example I could come up with. +1 that max/min of a string type isn't all that common in my experience. So it feels like this is something that is nice-to-have-at-some-point but not a critical feature for the next release. > > regards > Rahul > > > On 15-09-2012 13:39, Gabriel Reid wrote: > >> I now that I understand the full situation (sorry for not getting it >> sooner Rahul), I see that this is indeed a bit of an issue. >> >> The min and max methods act the way that would be most logically >> expected (in my mind), meaning the semantics of the sort are not >> exactly what would most logically be expected (again, in my mind). >> >> For example, if I have a custom class that is Comparable and is being >> serialized via Reflection with Avro, I would expect that the compareTo >> method on the class would be used for min/max and sort. Although this >> is how things work (really well) already for the min/max methods, I >> think that making sort work with this might be a pain and a lot >> slower. >> >> On the other hand, I'm not at all a user of either the sort or the >> min/max methods, and it seems mostly likely to me that they would all >> be used on numerical types that are built-in and will just work >> anyhow, so maybe this is a non-issue. Are there any use-cases with >> these methods on non-numerical data? >> >> - Gabriel >> >> >> On Fri, Sep 14, 2012 at 6:22 PM, Josh Wills <[email protected]> wrote: >> >>> So it seems reasonable to me for users to expect that min/max/sort give >>> consistent answers, even though the backing implementations of each are >>> different. >>> >>> I *think* that the right way to do this is to have some sort of notion >>> of a >>> PType that knows whether the type it refers to is Comparable and has a >>> method that consistently cmps elements of that type, either as Java types >>> or as serialized (Writable/Avro) types, and then we make the sort and >>> min/max APIs make use of that functionality. Then the assumption would be >>> that the min/max/sort APIs would be consistent _assuming_ that the >>> PCollection had the same associated PType when the min/max/sort method >>> was >>> called. >>> >>> J >>> >>> On Fri, Sep 14, 2012 at 6:33 AM, Rahul <[email protected]> wrote: >>> >>> Hi all, >>>> >>>> We have min/max/sort APIs in Crunch. The min and max rely on S(user >>>> type) >>>> being comparable while the Sort API relies on the corresponding writable >>>> type being comparable i. WritableComparable. To me the min and max API >>>> are special cases of Sort API and the three should be in sync with each >>>> other. If this is not the case then at-least theoretically we could >>>> have >>>> cases where sorting produces results that are different from min/max >>>> functions. We could adopt the Sort approach for all three but there are >>>> some issues in that api like if the Writable is not comparable then the >>>> error will not be that clear, S could have a comparator that is >>>> different >>>> from the Writable then the results are not as expected by user etc. Or >>>> maybe we can use comparable S in Sort api, I am not sure, but I think we >>>> would not be able to use hadoop shuffle and sort then. I do not have >>>> complete idea how we could make the three in sync. Any thoughts on the >>>> same >>>> ? But I would like to ask first should we even try to to do that ? or >>>> I am >>>> just cooking some theory and this has no practical use case. There has >>>> been >>>> some discussion on this in CRUNCH-57 <https://issues.apache.org/** >>>> jira/browse/CRUNCH-57 <https://issues.apache.org/** >>>> jira/browse/CRUNCH-57 <https://issues.apache.org/jira/browse/CRUNCH-57> >>>> >> >>>> issue. Let me know what you think. >>>> >>>> regards, >>>> Rahul >>>> >>>> >>>> >>>> >>> -- >>> Director of Data Science >>> Cloudera <http://www.cloudera.com> >>> Twitter: @josh_wills <http://twitter.com/josh_wills**> >>> >> > -- Director of Data Science Cloudera <http://www.cloudera.com> Twitter: @josh_wills <http://twitter.com/josh_wills>
