Re: [protobuf] Re: Generated hashcode() returns different values across JVM instances?

Jason Hsueh Wed, 18 May 2011 12:04:56 -0700

Ah, I didn't get this message. How does this break use in a HashMap? Given
two protobufs of the same type in the same JVM, with the same contents, does
hashCode() not return the same value?


On Wed, May 18, 2011 at 11:50 AM, Jay Booth <jaybo...@gmail.com> wrote:

> I'd also note that this precludes usage of protobuf objects in
> HashSets or as keys in HashMaps -- any time someone does that, it will
> be a (subtle and confusing) bug, as hashcode() isn't aligned with
> equals().
>
> On May 18, 2:48 pm, Jay Booth <jaybo...@gmail.com> wrote:
> > Well, that's your prerogative, I guess, but why even implement
> > hashcode at all then?  Just inherit from object and you're getting
> > effectively the same behavior.  Is that what you're intending?
> >
> > On May 16, 10:03 am, Pherl Liu <liuj...@google.com> wrote:
> >
> >
> >
> >
> >
> >
> >
> > > We discussed internally and decided not to make the hashCode()
> > > return deterministic result. If you need consistent hashcode in
> different
> > > runs, use toByteString().hashCode().
> >
> > > Quoted from Kenton:
> >
> > > Hashing the content of the descriptor would actually be incorrect,
> because
> > > two descriptors with exactly the same content are still considered
> different
> > > types.  Descriptors are compared by identity, hence they are hashed by
> > > pointer.
> >
> > > Removing the descriptor from the calculation would indeed make
> hashCode()
> > > consistent between two runs of the same binary, and probably
> insignificant
> > > runtime cost.  Of course, once you do that, you will never be able to
> > > introduce non-determinism again because people will depend on it.
> >
> > > But there's a much bigger risk.  People may actually start depending on
> > > hashCode() returning consistent results between two different versions
> of
> > > the binary, or two completely separate binaries that compile in the
> same
> > > protocol, or -- most dangerously -- two different versions of the same
> > > protocol (e.g. with fields added or removed).  I think it would be very
> > > difficult and limiting to make these guarantees, so I would be
> extremely
> > > cautious about this.
> >
> > > Certainly, there is no implementation of hashCode() that would be any
> safer
> > > than .toByteString().hashCode().  So, I'd advise steering people to the
> > > latter.  Note that if unknown fields are present, the results may still
> be
> > > inconsistent.  However, there is no reasonable way to implement a
> hashCode()
> > > that is consistent in the presence of unknown fields.
> >
> > > On Thu, May 12, 2011 at 5:32 AM, Ben Wright <compuware...@gmail.com>
> wrote:
> > > > I think we wrote those replies at the same time : )
> >
> > > > You're right, at the cost of some additional hash collisions, the
> > > > simplest solution is to simply not include the type / descriptor in
> > > > the hash calculation at all.
> >
> > > > The best / least-collision solutions with good performance would be
> > > > what I wrote in my previous post, but that requires that someone
> > > > (presumably a current committer) with sufficient knowledge of the
> > > > Descriptor types to have enough time to update the compiler and java
> > > > libraries accordingly.
> >
> > > > Any input from a committer for this issue?  Seems the simple solution
> > > > would take less than an hour to push into the stream and could make
> it
> > > > into the next release.
> >
> > > > On May 11, 5:25 pm, Ben Wright <compuware...@gmail.com> wrote:
> > > > > Alternatively... instead of putting the onus on the compiler, the
> > > > > hashcode could be computed by the JVM at initialization time for
> the
> > > > > Descriptor instance, (which would also help performance of
> dynamically
> > > > > parsed Descriptor instance hashcode calls).
> >
> > > > > i.e.
> >
> > > > > private final int computedHashcode;
> >
> > > > > public Descriptor() {
> > > > >    //initialization
> >
> > > > >   computedHashcode = do_compute_hashCode();
> >
> > > > > }
> >
> > > > > public int hashCode() {
> > > > >     return computedHashcode;
> >
> > > > > }
> >
> > > > > punlic int do_compute_hashCode(){
> > > > >   return // compute hashcode
> >
> > > > > }
> >
> > > > > This is all talking towards optimum performance implementation...
> the
> > > > > real problem is the need for a hashCode implementation for
> Descriptor
> > > > > based on the actual Descriptor's content...
> >
> > > > > On May 11, 4:54 pm, Ben Wright <compuware...@gmail.com> wrote:
> >
> > > > > > Jay:
> >
> > > > > > Using the class name to generate the hashcode is logically
> incorrect
> > > > > > because the class name can be derived by the options
> java_package_
> > > > > > name and java_outer_classname.
> >
> > > > > > Additionally (although less likely to matter), separate protocol
> > > > > > buffer files can define an identical class names with different
> > > > > > protocol buffers.
> >
> > > > > > Lastly, and most importantly...
> >
> > > > > > If the same Message is being used with generated code and with
> dynamic
> > > > > > code, the hash code for the descriptor would still be identical
> if
> > > > > > generated from the descriptor instance, whereas the dynamic usage
> does
> > > > > > not have a classname from which to derive a hashcode.  While in
> your
> > > > > > case this should not matter, it does matter for other users of
> > > > > > protobuf.  The hashcode function would be better served by being
> > > > > > implemented correctly from state data for the descriptor.
> > > > > > Additionally, in generated code it seems that this hashcode could
> be
> > > > > > pre-computed by the compiler and Descriptor.hashcode() could
> return a
> > > > > > constant integer - which would be much more efficient than any
> other
> > > > > > method.
> >
> > > > > > On May 11, 3:02 pm, Jay Booth <jaybo...@gmail.com> wrote:
> >
> > > > > > > It can be legitimate, especially in the case of
> Object.hashCode(),
> > > > but
> > > > > > > it's supposed to be in sync with equals() by contract.  As it
> stands,
> > > > > > > two objects which are equal() will produce different hashes, or
> the
> > > > > > > same logical object will produce different hashes across JVMs.
>  That
> > > > > > > breaks the contract..  if the equals() method simply did return
> > > > (other
> > > > > > > == this), then it'd be fine, albeit a little useless.
> >
> > > > > > > I created an issue and posted a 1-liner patch that would
> eliminate
> > > > the
> > > > > > > problem by using getClass().getName().hashCode() to incorporate
> type
> > > > > > > information into the hashCode without depending on a Descriptor
> > > > > > > object's memory address.
> >
> > > > > > > On May 11, 12:01 am, Dmitriy Ryaboy <dvrya...@gmail.com>
> wrote:
> >
> > > > > > > > Hi Jay,
> >
> > > > > > > > I encountered that before. Unfortunately this is a legitimate
> thing
> > > > to
> > > > > > > > do, as documented in Object.hashCode()
> >
> > > > > > > > I have a write-up of the problem and how we wound up solving
> it
> > > > (not
> > > > > > > > elegant.. suggestions welcome) here:
> > > >
> http://squarecog.wordpress.com/2011/02/20/hadoop-requires-stable-hash...
> >
> > > > > > > > D
> >
> > > > > > > > On Mon, May 9, 2011 at 8:25 AM, Jay Booth <
> jaybo...@gmail.com>
> > > > wrote:
> > > > > > > > > I'm testing an on-disk hashtable with Protobufs and noticed
> that
> > > > with
> > > > > > > > > the java generated hashcode function, it seems to return a
> > > > different
> > > > > > > > > hashcode across JVM invocations for the same logically
> equivalent
> > > > > > > > > object (tested with a single string protobuf, same string
> for
> > > > both
> > > > > > > > > instances).
> >
> > > > > > > > > Is this known behavior?  Bit busy right now backporting
> this to
> > > > work
> > > > > > > > > with String keys instead but I could provide a bit of
> command
> > > > line
> > > > > > > > > code that demonstrates the issue when I get a chance.
> >
> > > > > > > > > Glancing at the generated hashcode() function, it looks
> like the
> > > > > > > > > difference comes from etiher
> getDescriptorForType().hashCode() or
> > > > > > > > > getUnknownFields().hashCode(), both of which are
> incorporated.
> >
> > > > > > > > > --
> > > > > > > > > You received this message because you are subscribed to the
> > > > Google Groups "Protocol Buffers" group.
> > > > > > > > > To post to this group, send email to
> protobuf@googlegroups.com.
> > > > > > > > > To unsubscribe from this group, send email to
> > > > protobuf+unsubscr...@googlegroups.com.
> > > > > > > > > For more options, visit this group athttp://
> > > > groups.google.com/group/protobuf?hl=en.
> >
> > > > --
> > > > You received this message because you are subscribed to the Google
> Groups
> > > > "Protocol Buffers" group.
> > > > To post to this group, send email to protobuf@googlegroups.com.
> > > > To unsubscribe from this group, send email to
> > > > protobuf+unsubscr...@googlegroups.com.
> > > > For more options, visit this group at
> > > >http://groups.google.com/group/protobuf?hl=en.
>
> --
> You received this message because you are subscribed to the Google Groups
> "Protocol Buffers" group.
> To post to this group, send email to protobuf@googlegroups.com.
> To unsubscribe from this group, send email to
> protobuf+unsubscr...@googlegroups.com.
> For more options, visit this group at
> http://groups.google.com/group/protobuf?hl=en.
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"Protocol Buffers" group.
To post to this group, send email to protobuf@googlegroups.com.
To unsubscribe from this group, send email to 
protobuf+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/protobuf?hl=en.

Re: [protobuf] Re: Generated hashcode() returns different values across JVM instances?

Reply via email to