Re: [julia-users] The unique function and iterables of custom composite types

Stefan Karpinski Fri, 17 Jul 2015 08:51:07 -0700

I wonder if this whole business could be addressed better with a better
interface. When doing both hashing and equality, the real question is just
"what parts of this thing actually matter?" and "how do I compare them?" If
you had a "important_parts" function that equality and hashing were both
generically defined in terms of, then they couldn't possibly mismatch.
Obviously that function needs a better name.


On Fri, Jul 17, 2015 at 11:34 AM, Matt Bauman <mbau...@gmail.com> wrote:

> unique() is only broken for custom mutable types that are broken.  That
> method of ensuring differentiation is equality.  Here's what happened:
>
> * Define mutable type Foo
> * Define equality by Foo's contents
> * But don't define a new hash function, and instead fall back on the
> default implementation (which hashes differently based upon object identity)
>
> Now we're in a situation where two objects that compare as equal have
> different hashes.  When we go to insert an object of this type into a keys
> of dictionary (or set, as used by unique), here's what happens:
>
> * The dictionary asks for the hash of the key
> * Dictionaries start with 16 "slots" where they can store objects, so it
> just looks at the last hex digit from the hash to decide which one to use
> * If there's already an object in that slot, the dictionary must determine
> if it's the object it's looking for or simply a collision (which is
> expected).  So it checks if the existing object is equal to the new one. If
> it's equal, then that's the slot it's looking for.  If not, then it checks
> if it's in the next slot.  And so on.
>
> * So now we're in this inconsistent state: most of the time (15 out of 16
> times) objects that claim to be equal get put in different slots because
> their hashes are different.  But sometimes they'll happen to end up in the
> same slot, and you get this very different behavior (which happened to be
> what you want in this case — the one-element uniqued vector).
>
>
> On Friday, July 17, 2015 at 10:55:31 AM UTC-4, Seth wrote:
>>
>> But doesn't mean that unique() will stay broken for custom mutable types?
>> That is, if unique() on custom mutable types relies on the fact that the
>> last hash octet is different (and therefore there are no hash collisions),
>> then there has to be SOME method of ensuring differentiation. What am I
>> missing?
>>
>> On Friday, July 17, 2015 at 5:40:42 AM UTC-7, Stefan Karpinski wrote:
>>>
>>> Two reasons:
>>>
>>>    1. That would entail recomputing the hash of any object that's
>>>    already in the same table slot (which is not expected to be *that* rare 
>>> in
>>>    a hash table).
>>>    2. The == function (well, isequal, really), if correctly defined
>>>    should always imply that hashes match. If it's possible to get different
>>>    hashes when two objects are == then either your hash or your == function
>>>    are broken.
>>>
>>> If we stored hash values in the hash table, then reason 1 would go away
>>> (at a significant memory cost), so we could warn the user when 2 occurs and
>>> they have a mismatch between their == and hash functions.
>>>
>>> On Fri, Jul 17, 2015 at 3:32 AM, Jeffrey Sarnoff <jeffrey...@gmail.com>
>>> wrote:
>>>
>>>> just curious, why are the full hashes not compared when least
>>>> significant octets match (above)?
>>>>
>>>>
>>>>
>>>> On Thursday, July 16, 2015 at 12:25:01 PM UTC-4, Matt Bauman wrote:
>>>>
>>>>> On Thursday, July 16, 2015 at 12:19:25 PM UTC-4, milktrader wrote:
>>>>>>
>>>>>> Also, back to the OP question, is the correct solution to simply
>>>>>> define
>>>>>>
>>>>>>  Base.hash(f::Foo) = f.x
>>>>>>
>>>>>
>>>>> No, I'd define Base.hash(f::Foo) = hash(f.x, 0x64c74221932dea5b),
>>>>> where I chose the constant by rand(UInt).  This way it won't collide with
>>>>> other types.  I really need to spend a bit more time with my interfaces
>>>>> chapter and add this info there.
>>>>>
>>>>> On Thursday, July 16, 2015 at 12:07:50 PM UTC-4, Seth wrote:
>>>>>
>>>>>> If I'm reading between the lines correctly, the default/existing hash
>>>>>> function is based on the last byte of the ID? Is there a reason we don't
>>>>>> make the table wider to reduce the chances of collisions (or would this
>>>>>> have bad effects on memory utilization)?
>>>>>>
>>>>>
>>>>> It's not really a hash collision, but rather a hashtable index
>>>>> collision.  A new dict only has 16 slots for its keys, and the index for 
>>>>> an
>>>>> element is simply chosen by the last 8 bits of its hash.  If those last 8
>>>>> bits match the hash of an object already in the table, there's a 
>>>>> collision,
>>>>> and the Dict checks to see if the existing element `isequal` to the new 
>>>>> one
>>>>> (and doesn't look at the hash again).  So the hashes are different, but
>>>>> isequal says the objects are the same.  This is why things are wonky.
>>>>>
>>>>
>>>

Re: [julia-users] The unique function and iterables of custom composite types

Reply via email to