Re: [protobuf] Re: Generated hashcode() returns different values across JVM instances?

Pherl Liu Mon, 16 May 2011 07:03:13 -0700

We discussed internally and decided not to make the hashCode()
return deterministic result. If you need consistent hashcode in different
runs, use toByteString().hashCode().


Quoted from Kenton:

Hashing the content of the descriptor would actually be incorrect, because
two descriptors with exactly the same content are still considered different
types.  Descriptors are compared by identity, hence they are hashed by
pointer.

Removing the descriptor from the calculation would indeed make hashCode()
consistent between two runs of the same binary, and probably insignificant
runtime cost.  Of course, once you do that, you will never be able to
introduce non-determinism again because people will depend on it.

But there's a much bigger risk.  People may actually start depending on
hashCode() returning consistent results between two different versions of
the binary, or two completely separate binaries that compile in the same
protocol, or -- most dangerously -- two different versions of the same
protocol (e.g. with fields added or removed).  I think it would be very
difficult and limiting to make these guarantees, so I would be extremely
cautious about this.

Certainly, there is no implementation of hashCode() that would be any safer
than .toByteString().hashCode().  So, I'd advise steering people to the
latter.  Note that if unknown fields are present, the results may still be
inconsistent.  However, there is no reasonable way to implement a hashCode()
that is consistent in the presence of unknown fields.

On Thu, May 12, 2011 at 5:32 AM, Ben Wright <compuware...@gmail.com> wrote:

> I think we wrote those replies at the same time : )
>
> You're right, at the cost of some additional hash collisions, the
> simplest solution is to simply not include the type / descriptor in
> the hash calculation at all.
>
> The best / least-collision solutions with good performance would be
> what I wrote in my previous post, but that requires that someone
> (presumably a current committer) with sufficient knowledge of the
> Descriptor types to have enough time to update the compiler and java
> libraries accordingly.
>
> Any input from a committer for this issue?  Seems the simple solution
> would take less than an hour to push into the stream and could make it
> into the next release.
>
> On May 11, 5:25 pm, Ben Wright <compuware...@gmail.com> wrote:
> > Alternatively... instead of putting the onus on the compiler, the
> > hashcode could be computed by the JVM at initialization time for the
> > Descriptor instance, (which would also help performance of dynamically
> > parsed Descriptor instance hashcode calls).
> >
> > i.e.
> >
> > private final int computedHashcode;
> >
> > public Descriptor() {
> >    //initialization
> >
> >   computedHashcode = do_compute_hashCode();
> >
> > }
> >
> > public int hashCode() {
> >     return computedHashcode;
> >
> > }
> >
> > punlic int do_compute_hashCode(){
> >   return // compute hashcode
> >
> > }
> >
> > This is all talking towards optimum performance implementation... the
> > real problem is the need for a hashCode implementation for Descriptor
> > based on the actual Descriptor's content...
> >
> > On May 11, 4:54 pm, Ben Wright <compuware...@gmail.com> wrote:
> >
> >
> >
> >
> >
> >
> >
> > > Jay:
> >
> > > Using the class name to generate the hashcode is logically incorrect
> > > because the class name can be derived by the options java_package_
> > > name and java_outer_classname.
> >
> > > Additionally (although less likely to matter), separate protocol
> > > buffer files can define an identical class names with different
> > > protocol buffers.
> >
> > > Lastly, and most importantly...
> >
> > > If the same Message is being used with generated code and with dynamic
> > > code, the hash code for the descriptor would still be identical if
> > > generated from the descriptor instance, whereas the dynamic usage does
> > > not have a classname from which to derive a hashcode.  While in your
> > > case this should not matter, it does matter for other users of
> > > protobuf.  The hashcode function would be better served by being
> > > implemented correctly from state data for the descriptor.
> > > Additionally, in generated code it seems that this hashcode could be
> > > pre-computed by the compiler and Descriptor.hashcode() could return a
> > > constant integer - which would be much more efficient than any other
> > > method.
> >
> > > On May 11, 3:02 pm, Jay Booth <jaybo...@gmail.com> wrote:
> >
> > > > It can be legitimate, especially in the case of Object.hashCode(),
> but
> > > > it's supposed to be in sync with equals() by contract.  As it stands,
> > > > two objects which are equal() will produce different hashes, or the
> > > > same logical object will produce different hashes across JVMs.  That
> > > > breaks the contract..  if the equals() method simply did return
> (other
> > > > == this), then it'd be fine, albeit a little useless.
> >
> > > > I created an issue and posted a 1-liner patch that would eliminate
> the
> > > > problem by using getClass().getName().hashCode() to incorporate type
> > > > information into the hashCode without depending on a Descriptor
> > > > object's memory address.
> >
> > > > On May 11, 12:01 am, Dmitriy Ryaboy <dvrya...@gmail.com> wrote:
> >
> > > > > Hi Jay,
> >
> > > > > I encountered that before. Unfortunately this is a legitimate thing
> to
> > > > > do, as documented in Object.hashCode()
> >
> > > > > I have a write-up of the problem and how we wound up solving it
> (not
> > > > > elegant.. suggestions welcome) here:
> http://squarecog.wordpress.com/2011/02/20/hadoop-requires-stable-hash...
> >
> > > > > D
> >
> > > > > On Mon, May 9, 2011 at 8:25 AM, Jay Booth <jaybo...@gmail.com>
> wrote:
> > > > > > I'm testing an on-disk hashtable with Protobufs and noticed that
> with
> > > > > > the java generated hashcode function, it seems to return a
> different
> > > > > > hashcode across JVM invocations for the same logically equivalent
> > > > > > object (tested with a single string protobuf, same string for
> both
> > > > > > instances).
> >
> > > > > > Is this known behavior?  Bit busy right now backporting this to
> work
> > > > > > with String keys instead but I could provide a bit of command
> line
> > > > > > code that demonstrates the issue when I get a chance.
> >
> > > > > > Glancing at the generated hashcode() function, it looks like the
> > > > > > difference comes from etiher getDescriptorForType().hashCode() or
> > > > > > getUnknownFields().hashCode(), both of which are incorporated.
> >
> > > > > > --
> > > > > > You received this message because you are subscribed to the
> Google Groups "Protocol Buffers" group.
> > > > > > To post to this group, send email to protobuf@googlegroups.com.
> > > > > > To unsubscribe from this group, send email to
> protobuf+unsubscr...@googlegroups.com.
> > > > > > For more options, visit this group athttp://
> groups.google.com/group/protobuf?hl=en.
>
> --
> You received this message because you are subscribed to the Google Groups
> "Protocol Buffers" group.
> To post to this group, send email to protobuf@googlegroups.com.
> To unsubscribe from this group, send email to
> protobuf+unsubscr...@googlegroups.com.
> For more options, visit this group at
> http://groups.google.com/group/protobuf?hl=en.
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"Protocol Buffers" group.
To post to this group, send email to protobuf@googlegroups.com.
To unsubscribe from this group, send email to 
protobuf+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/protobuf?hl=en.

Re: [protobuf] Re: Generated hashcode() returns different values across JVM instances?

Reply via email to