On Thu, Feb 23, 2012 at 4:38 PM, Ted Dunning <[email protected]> wrote:
> D, > > I think that the current idea on the table is to do essentially nothing > except carefully examine and possibly remove cases of tables or sets with > Vector keys. > Yes, that's what I'm suggesting. If you grep through our code for Map<Vector or Map<DenseVector, or Map<SparseVector, - there are a few cases, and we might want to see if any of these are sources of inefficiency in this sneaky way. -jake > > What specifically are you arguing against? > > On Fri, Feb 24, 2012 at 12:25 AM, Dmitriy Lyubimov <[email protected] > >wrote: > > > Right. Ok. > > > > just in case, here's my updated list of cons ... > > > > 1) if people explicitly happen to write a.equals(b) where they just as > > easily could've written a==b if they were ok with identity comparison, > > it is pretty surely means they expect equals-by-value and not identity > > comparison. > > 2) same goes for HashMap/IdentityHashMap (with much less obviousness > > though). > > > > 3) We probably don't want to prohibit the use of vectors as MR map > > output keys (although i don't immediately see use cases for it), that > > means we have to support comparable-by-value, which would be > > contractually incoherent with equals-by-identity at the same time. > > > > 4) What Ted said about precaching hash codes for Strings I seem to be > > able to easily convert into another cons: apparently best practice > > suggests to go extra step to optimize hash coding rather than to drop > > equals-by-value, hashCode-by-value. There are techniques to improve > > that (e.g. same identity comparison can be had since if a==b then it > > also true that a.equals(b) also true. Hashcoding can be indeed cached > > (with caveat that since vector really is not immutable, then it needs > > to maintain lazy cashing and cache cancellation simiar to > > GregorianCalendar stuff, but that's probably a hassle). > > > > I guess we can look into some hash coding optimization if it seems > > compelling that we do so. well i guess Sean is saying that too. > > > > On Thu, Feb 23, 2012 at 2:31 PM, Sean Owen <[email protected]> wrote: > > > I think equals() and hashCode() ought to be retained at least for > > > consistency. There is no sin in implementing these per se. If > hashCode() > > is > > > too slow in some context then it can be cached if it is so important. > > This > > > is what String does. > > > > > > However I doubt these should be used as keys in general - that is the > > issue > > > to be fixed if anything if there is a performance problem. > > > > > > Do you want people to use equals()? Dunno it's up to the caller really. > > > > > > Sean > > > On Feb 23, 2012 5:25 PM, "Jake Mannix" <[email protected]> wrote: > > > > > >> Hey Devs. > > >> > > >> Was prototyping some stuff in Mahout last night, and noticed > something > > >> I'm not sure if we've talked about before: because we have equals() > for > > >> Vector instances return true iff the numeric values of the vectors are > > >> equal, and we also have a consistent hashCode(), anytime you have > > >> HashMap<Vector, Anything>, all the typical things you think are O(1) > are > > >> really O(vector.numNonZeroes()). I tried to look through the codebase > > and > > >> see where we hang onto maps with vector keys, and we do it sometimes. > > >> Maybe we shouldn't? Most Vectors have identities (clusterId, > > documentId, > > >> topicId, etc...) which we could normalize away... or maybe we should > be > > >> using IdentityHashMap, to ensure you're using strict object identity > and > > >> avoid doing this calculation? This could be really slow if these are > > big > > >> dense vectors, for instance. > > >> > > >> This looks like it could be a really easy place to accidentally add > > heavy > > >> complexity to things. Do we really want people do be checking > > >> *mathematical* equals() on vectors which have floating point > precision? > > >> > > >> -jake > > >> > > >
