Re: [julia-users] The unique function and iterables of custom composite types

Matt Bauman Fri, 17 Jul 2015 08:35:43 -0700

unique() is only broken for custom mutable types that are broken.  That 
method of ensuring differentiation is equality.  Here's what happened:


* Define mutable type Foo
* Define equality by Foo's contents
* But don't define a new hash function, and instead fall back on the 
default implementation (which hashes differently based upon object identity)

Now we're in a situation where two objects that compare as equal have 
different hashes.  When we go to insert an object of this type into a keys 
of dictionary (or set, as used by unique), here's what happens:

* The dictionary asks for the hash of the key
* Dictionaries start with 16 "slots" where they can store objects, so it 
just looks at the last hex digit from the hash to decide which one to use
* If there's already an object in that slot, the dictionary must determine 
if it's the object it's looking for or simply a collision (which is 
expected).  So it checks if the existing object is equal to the new one. If 
it's equal, then that's the slot it's looking for.  If not, then it checks 
if it's in the next slot.  And so on.

* So now we're in this inconsistent state: most of the time (15 out of 16 
times) objects that claim to be equal get put in different slots because 
their hashes are different.  But sometimes they'll happen to end up in the 
same slot, and you get this very different behavior (which happened to be 
what you want in this case — the one-element uniqued vector).


On Friday, July 17, 2015 at 10:55:31 AM UTC-4, Seth wrote:
>
> But doesn't mean that unique() will stay broken for custom mutable types? 
> That is, if unique() on custom mutable types relies on the fact that the 
> last hash octet is different (and therefore there are no hash collisions), 
> then there has to be SOME method of ensuring differentiation. What am I 
> missing?
>
> On Friday, July 17, 2015 at 5:40:42 AM UTC-7, Stefan Karpinski wrote:
>>
>> Two reasons:
>>
>>    1. That would entail recomputing the hash of any object that's 
>>    already in the same table slot (which is not expected to be *that* rare 
>> in 
>>    a hash table).
>>    2. The == function (well, isequal, really), if correctly defined 
>>    should always imply that hashes match. If it's possible to get different 
>>    hashes when two objects are == then either your hash or your == function 
>>    are broken.
>>    
>> If we stored hash values in the hash table, then reason 1 would go away 
>> (at a significant memory cost), so we could warn the user when 2 occurs and 
>> they have a mismatch between their == and hash functions.
>>
>> On Fri, Jul 17, 2015 at 3:32 AM, Jeffrey Sarnoff <jeffrey...@gmail.com> 
>> wrote:
>>
>>> just curious, why are the full hashes not compared when least 
>>> significant octets match (above)?
>>>
>>>
>>>
>>> On Thursday, July 16, 2015 at 12:25:01 PM UTC-4, Matt Bauman wrote:
>>>
>>>> On Thursday, July 16, 2015 at 12:19:25 PM UTC-4, milktrader wrote:
>>>>>
>>>>> Also, back to the OP question, is the correct solution to simply 
>>>>> define 
>>>>>
>>>>>  Base.hash(f::Foo) = f.x
>>>>>
>>>>
>>>> No, I'd define Base.hash(f::Foo) = hash(f.x, 0x64c74221932dea5b), 
>>>> where I chose the constant by rand(UInt).  This way it won't collide with 
>>>> other types.  I really need to spend a bit more time with my interfaces 
>>>> chapter and add this info there.
>>>>
>>>> On Thursday, July 16, 2015 at 12:07:50 PM UTC-4, Seth wrote:
>>>>
>>>>> If I'm reading between the lines correctly, the default/existing hash 
>>>>> function is based on the last byte of the ID? Is there a reason we don't 
>>>>> make the table wider to reduce the chances of collisions (or would this 
>>>>> have bad effects on memory utilization)?
>>>>>
>>>>
>>>> It's not really a hash collision, but rather a hashtable index 
>>>> collision.  A new dict only has 16 slots for its keys, and the index for 
>>>> an 
>>>> element is simply chosen by the last 8 bits of its hash.  If those last 8 
>>>> bits match the hash of an object already in the table, there's a 
>>>> collision, 
>>>> and the Dict checks to see if the existing element `isequal` to the new 
>>>> one 
>>>> (and doesn't look at the hash again).  So the hashes are different, but 
>>>> isequal says the objects are the same.  This is why things are wonky.
>>>>
>>>
>>

Re: [julia-users] The unique function and iterables of custom composite types

Reply via email to