The quantity of features required of Vector&Matrix is at the edge of
practicality.  I would just ban equals()/hashCode() entirely, throwing
exceptions. If you must, I would keep a common XOR value altered with
each set() operation. Supporting sparsity is possible.

On Thu, Feb 23, 2012 at 5:26 PM, Jake Mannix <[email protected]> wrote:
> On Thu, Feb 23, 2012 at 4:38 PM, Ted Dunning <[email protected]> wrote:
>
>> D,
>>
>> I think that the current idea on the table is to do essentially nothing
>> except carefully examine and possibly remove cases of tables or sets with
>> Vector keys.
>>
>
> Yes, that's what I'm suggesting.  If you grep through our code for
> Map<Vector
> or Map<DenseVector, or Map<SparseVector, - there are a few cases, and we
> might want to see if any of these are sources of inefficiency in this sneaky
> way.
>
>  -jake
>
>
>>
>> What specifically are you arguing against?
>>
>> On Fri, Feb 24, 2012 at 12:25 AM, Dmitriy Lyubimov <[email protected]
>> >wrote:
>>
>> > Right. Ok.
>> >
>> > just in case, here's my updated list of cons ...
>> >
>> > 1) if people explicitly happen to write a.equals(b) where they just as
>> > easily could've written a==b if they were ok with identity comparison,
>> > it is pretty surely means they expect equals-by-value and not identity
>> > comparison.
>> > 2) same goes for HashMap/IdentityHashMap (with much less obviousness
>> > though).
>> >
>> > 3) We probably don't want to prohibit the use of vectors as MR map
>> > output keys (although i don't immediately see use cases for it), that
>> > means we have to support comparable-by-value, which would be
>> > contractually incoherent with equals-by-identity at the same time.
>> >
>> > 4) What Ted said about precaching hash codes for Strings I seem to be
>> > able to easily convert into another cons: apparently best practice
>> > suggests to go extra step to optimize hash coding rather than to drop
>> > equals-by-value, hashCode-by-value. There are techniques to improve
>> > that (e.g. same identity comparison can be had since if a==b then it
>> > also true that a.equals(b) also true. Hashcoding can be indeed cached
>> > (with caveat that since vector really is not immutable, then it needs
>> > to maintain lazy cashing and cache cancellation simiar to
>> > GregorianCalendar stuff, but that's probably a hassle).
>> >
>> > I guess we can look into some hash coding optimization if it seems
>> > compelling that we do so. well i guess Sean is saying that too.
>> >
>> > On Thu, Feb 23, 2012 at 2:31 PM, Sean Owen <[email protected]> wrote:
>> > > I think equals() and hashCode() ought to be retained at least for
>> > > consistency. There is no sin in implementing these per se. If
>> hashCode()
>> > is
>> > > too slow in some context then it can be cached if it is so important.
>> > This
>> > > is what String does.
>> > >
>> > > However I doubt these should be used as keys in general - that is the
>> > issue
>> > > to be fixed if anything if there is a performance problem.
>> > >
>> > > Do you want people to use equals()? Dunno it's up to the caller really.
>> > >
>> > > Sean
>> > > On Feb 23, 2012 5:25 PM, "Jake Mannix" <[email protected]> wrote:
>> > >
>> > >> Hey Devs.
>> > >>
>> > >>  Was prototyping some stuff in Mahout last night, and noticed
>> something
>> > >> I'm not sure if we've talked about before: because we have equals()
>> for
>> > >> Vector instances return true iff the numeric values of the vectors are
>> > >> equal, and we also have a consistent hashCode(), anytime you have
>> > >> HashMap<Vector, Anything>, all the typical things you think are O(1)
>> are
>> > >> really O(vector.numNonZeroes()).  I tried to look through the codebase
>> > and
>> > >> see where we hang onto maps with vector keys, and we do it sometimes.
>> > >>  Maybe we shouldn't?  Most Vectors have identities (clusterId,
>> > documentId,
>> > >> topicId, etc...) which we could normalize away... or maybe we should
>> be
>> > >> using IdentityHashMap, to ensure you're using strict object identity
>> and
>> > >> avoid doing this calculation?  This could be really slow if these are
>> > big
>> > >> dense vectors, for instance.
>> > >>
>> > >>  This looks like it could be a really easy place to accidentally add
>> > heavy
>> > >> complexity to things.  Do we really want people do be checking
>> > >> *mathematical* equals() on vectors which have floating point
>> precision?
>> > >>
>> > >>  -jake
>> > >>
>> >
>>



-- 
Lance Norskog
[email protected]

Reply via email to