[protobuf] Re: Generated hashcode() returns different values across JVM instances?

2011-05-18 Thread Jay Booth
Well, that's your prerogative, I guess, but why even implement
hashcode at all then?  Just inherit from object and you're getting
effectively the same behavior.  Is that what you're intending?

On May 16, 10:03 am, Pherl Liu liuj...@google.com wrote:
 We discussed internally and decided not to make the hashCode()
 return deterministic result. If you need consistent hashcode in different
 runs, use toByteString().hashCode().

 Quoted from Kenton:

 Hashing the content of the descriptor would actually be incorrect, because
 two descriptors with exactly the same content are still considered different
 types.  Descriptors are compared by identity, hence they are hashed by
 pointer.

 Removing the descriptor from the calculation would indeed make hashCode()
 consistent between two runs of the same binary, and probably insignificant
 runtime cost.  Of course, once you do that, you will never be able to
 introduce non-determinism again because people will depend on it.

 But there's a much bigger risk.  People may actually start depending on
 hashCode() returning consistent results between two different versions of
 the binary, or two completely separate binaries that compile in the same
 protocol, or -- most dangerously -- two different versions of the same
 protocol (e.g. with fields added or removed).  I think it would be very
 difficult and limiting to make these guarantees, so I would be extremely
 cautious about this.

 Certainly, there is no implementation of hashCode() that would be any safer
 than .toByteString().hashCode().  So, I'd advise steering people to the
 latter.  Note that if unknown fields are present, the results may still be
 inconsistent.  However, there is no reasonable way to implement a hashCode()
 that is consistent in the presence of unknown fields.







 On Thu, May 12, 2011 at 5:32 AM, Ben Wright compuware...@gmail.com wrote:
  I think we wrote those replies at the same time : )

  You're right, at the cost of some additional hash collisions, the
  simplest solution is to simply not include the type / descriptor in
  the hash calculation at all.

  The best / least-collision solutions with good performance would be
  what I wrote in my previous post, but that requires that someone
  (presumably a current committer) with sufficient knowledge of the
  Descriptor types to have enough time to update the compiler and java
  libraries accordingly.

  Any input from a committer for this issue?  Seems the simple solution
  would take less than an hour to push into the stream and could make it
  into the next release.

  On May 11, 5:25 pm, Ben Wright compuware...@gmail.com wrote:
   Alternatively... instead of putting the onus on the compiler, the
   hashcode could be computed by the JVM at initialization time for the
   Descriptor instance, (which would also help performance of dynamically
   parsed Descriptor instance hashcode calls).

   i.e.

   private final int computedHashcode;

   public Descriptor() {
      //initialization

     computedHashcode = do_compute_hashCode();

   }

   public int hashCode() {
       return computedHashcode;

   }

   punlic int do_compute_hashCode(){
     return // compute hashcode

   }

   This is all talking towards optimum performance implementation... the
   real problem is the need for a hashCode implementation for Descriptor
   based on the actual Descriptor's content...

   On May 11, 4:54 pm, Ben Wright compuware...@gmail.com wrote:

Jay:

Using the class name to generate the hashcode is logically incorrect
because the class name can be derived by the options java_package_
name and java_outer_classname.

Additionally (although less likely to matter), separate protocol
buffer files can define an identical class names with different
protocol buffers.

Lastly, and most importantly...

If the same Message is being used with generated code and with dynamic
code, the hash code for the descriptor would still be identical if
generated from the descriptor instance, whereas the dynamic usage does
not have a classname from which to derive a hashcode.  While in your
case this should not matter, it does matter for other users of
protobuf.  The hashcode function would be better served by being
implemented correctly from state data for the descriptor.
Additionally, in generated code it seems that this hashcode could be
pre-computed by the compiler and Descriptor.hashcode() could return a
constant integer - which would be much more efficient than any other
method.

On May 11, 3:02 pm, Jay Booth jaybo...@gmail.com wrote:

 It can be legitimate, especially in the case of Object.hashCode(),
  but
 it's supposed to be in sync with equals() by contract.  As it stands,
 two objects which are equal() will produce different hashes, or the
 same logical object will produce different hashes across JVMs.  That
 breaks the contract..  if the equals() method simply did 

Re: [protobuf] Re: Generated hashcode() returns different values across JVM instances?

2011-05-18 Thread Jason Hsueh
Jumping in late to the thread, and I'm not really a Java person, so I may be
misunderstanding something here. But as far as I can tell, you are asking
for hashCode() to be a 'consistent' hash across processes. hashCode() as
implemented is still useful within a single JVM, allowing you to use
protobufs in HashMaps based on content rather than object identity. That was
the intended use case.

On Wed, May 18, 2011 at 11:48 AM, Jay Booth jaybo...@gmail.com wrote:

 Well, that's your prerogative, I guess, but why even implement
 hashcode at all then?  Just inherit from object and you're getting
 effectively the same behavior.  Is that what you're intending?

 On May 16, 10:03 am, Pherl Liu liuj...@google.com wrote:
  We discussed internally and decided not to make the hashCode()
  return deterministic result. If you need consistent hashcode in different
  runs, use toByteString().hashCode().
 
  Quoted from Kenton:
 
  Hashing the content of the descriptor would actually be incorrect,
 because
  two descriptors with exactly the same content are still considered
 different
  types.  Descriptors are compared by identity, hence they are hashed by
  pointer.
 
  Removing the descriptor from the calculation would indeed make hashCode()
  consistent between two runs of the same binary, and probably
 insignificant
  runtime cost.  Of course, once you do that, you will never be able to
  introduce non-determinism again because people will depend on it.
 
  But there's a much bigger risk.  People may actually start depending on
  hashCode() returning consistent results between two different versions of
  the binary, or two completely separate binaries that compile in the same
  protocol, or -- most dangerously -- two different versions of the same
  protocol (e.g. with fields added or removed).  I think it would be very
  difficult and limiting to make these guarantees, so I would be extremely
  cautious about this.
 
  Certainly, there is no implementation of hashCode() that would be any
 safer
  than .toByteString().hashCode().  So, I'd advise steering people to the
  latter.  Note that if unknown fields are present, the results may still
 be
  inconsistent.  However, there is no reasonable way to implement a
 hashCode()
  that is consistent in the presence of unknown fields.
 
 
 
 
 
 
 
  On Thu, May 12, 2011 at 5:32 AM, Ben Wright compuware...@gmail.com
 wrote:
   I think we wrote those replies at the same time : )
 
   You're right, at the cost of some additional hash collisions, the
   simplest solution is to simply not include the type / descriptor in
   the hash calculation at all.
 
   The best / least-collision solutions with good performance would be
   what I wrote in my previous post, but that requires that someone
   (presumably a current committer) with sufficient knowledge of the
   Descriptor types to have enough time to update the compiler and java
   libraries accordingly.
 
   Any input from a committer for this issue?  Seems the simple solution
   would take less than an hour to push into the stream and could make it
   into the next release.
 
   On May 11, 5:25 pm, Ben Wright compuware...@gmail.com wrote:
Alternatively... instead of putting the onus on the compiler, the
hashcode could be computed by the JVM at initialization time for the
Descriptor instance, (which would also help performance of
 dynamically
parsed Descriptor instance hashcode calls).
 
i.e.
 
private final int computedHashcode;
 
public Descriptor() {
   //initialization
 
  computedHashcode = do_compute_hashCode();
 
}
 
public int hashCode() {
return computedHashcode;
 
}
 
punlic int do_compute_hashCode(){
  return // compute hashcode
 
}
 
This is all talking towards optimum performance implementation... the
real problem is the need for a hashCode implementation for Descriptor
based on the actual Descriptor's content...
 
On May 11, 4:54 pm, Ben Wright compuware...@gmail.com wrote:
 
 Jay:
 
 Using the class name to generate the hashcode is logically
 incorrect
 because the class name can be derived by the options java_package_
 name and java_outer_classname.
 
 Additionally (although less likely to matter), separate protocol
 buffer files can define an identical class names with different
 protocol buffers.
 
 Lastly, and most importantly...
 
 If the same Message is being used with generated code and with
 dynamic
 code, the hash code for the descriptor would still be identical if
 generated from the descriptor instance, whereas the dynamic usage
 does
 not have a classname from which to derive a hashcode.  While in
 your
 case this should not matter, it does matter for other users of
 protobuf.  The hashcode function would be better served by being
 implemented correctly from state data for the descriptor.
 Additionally, in generated code it seems that this hashcode could
 

Re: [protobuf] Re: Generated hashcode() returns different values across JVM instances?

2011-05-18 Thread Jason Hsueh
Ah, I didn't get this message. How does this break use in a HashMap? Given
two protobufs of the same type in the same JVM, with the same contents, does
hashCode() not return the same value?

On Wed, May 18, 2011 at 11:50 AM, Jay Booth jaybo...@gmail.com wrote:

 I'd also note that this precludes usage of protobuf objects in
 HashSets or as keys in HashMaps -- any time someone does that, it will
 be a (subtle and confusing) bug, as hashcode() isn't aligned with
 equals().

 On May 18, 2:48 pm, Jay Booth jaybo...@gmail.com wrote:
  Well, that's your prerogative, I guess, but why even implement
  hashcode at all then?  Just inherit from object and you're getting
  effectively the same behavior.  Is that what you're intending?
 
  On May 16, 10:03 am, Pherl Liu liuj...@google.com wrote:
 
 
 
 
 
 
 
   We discussed internally and decided not to make the hashCode()
   return deterministic result. If you need consistent hashcode in
 different
   runs, use toByteString().hashCode().
 
   Quoted from Kenton:
 
   Hashing the content of the descriptor would actually be incorrect,
 because
   two descriptors with exactly the same content are still considered
 different
   types.  Descriptors are compared by identity, hence they are hashed by
   pointer.
 
   Removing the descriptor from the calculation would indeed make
 hashCode()
   consistent between two runs of the same binary, and probably
 insignificant
   runtime cost.  Of course, once you do that, you will never be able to
   introduce non-determinism again because people will depend on it.
 
   But there's a much bigger risk.  People may actually start depending on
   hashCode() returning consistent results between two different versions
 of
   the binary, or two completely separate binaries that compile in the
 same
   protocol, or -- most dangerously -- two different versions of the same
   protocol (e.g. with fields added or removed).  I think it would be very
   difficult and limiting to make these guarantees, so I would be
 extremely
   cautious about this.
 
   Certainly, there is no implementation of hashCode() that would be any
 safer
   than .toByteString().hashCode().  So, I'd advise steering people to the
   latter.  Note that if unknown fields are present, the results may still
 be
   inconsistent.  However, there is no reasonable way to implement a
 hashCode()
   that is consistent in the presence of unknown fields.
 
   On Thu, May 12, 2011 at 5:32 AM, Ben Wright compuware...@gmail.com
 wrote:
I think we wrote those replies at the same time : )
 
You're right, at the cost of some additional hash collisions, the
simplest solution is to simply not include the type / descriptor in
the hash calculation at all.
 
The best / least-collision solutions with good performance would be
what I wrote in my previous post, but that requires that someone
(presumably a current committer) with sufficient knowledge of the
Descriptor types to have enough time to update the compiler and java
libraries accordingly.
 
Any input from a committer for this issue?  Seems the simple solution
would take less than an hour to push into the stream and could make
 it
into the next release.
 
On May 11, 5:25 pm, Ben Wright compuware...@gmail.com wrote:
 Alternatively... instead of putting the onus on the compiler, the
 hashcode could be computed by the JVM at initialization time for
 the
 Descriptor instance, (which would also help performance of
 dynamically
 parsed Descriptor instance hashcode calls).
 
 i.e.
 
 private final int computedHashcode;
 
 public Descriptor() {
//initialization
 
   computedHashcode = do_compute_hashCode();
 
 }
 
 public int hashCode() {
 return computedHashcode;
 
 }
 
 punlic int do_compute_hashCode(){
   return // compute hashcode
 
 }
 
 This is all talking towards optimum performance implementation...
 the
 real problem is the need for a hashCode implementation for
 Descriptor
 based on the actual Descriptor's content...
 
 On May 11, 4:54 pm, Ben Wright compuware...@gmail.com wrote:
 
  Jay:
 
  Using the class name to generate the hashcode is logically
 incorrect
  because the class name can be derived by the options
 java_package_
  name and java_outer_classname.
 
  Additionally (although less likely to matter), separate protocol
  buffer files can define an identical class names with different
  protocol buffers.
 
  Lastly, and most importantly...
 
  If the same Message is being used with generated code and with
 dynamic
  code, the hash code for the descriptor would still be identical
 if
  generated from the descriptor instance, whereas the dynamic usage
 does
  not have a classname from which to derive a hashcode.  While in
 your
  case this should not matter, it does matter for other users of
  protobuf.  The hashcode function would be 

Re: [protobuf] Re: Generated hashcode() returns different values across JVM instances?

2011-05-18 Thread Ron Reynolds
the corner-stone of Hash* containers is:
 (A.equals(B)) = (A.hashCode() == B.hashCode()) for all A, B.  

tho it's not explicitly stated it would seem to be implied that is within a 
single JVM.
not sure if the code in question maintains that rule within a JVM (if not 
that's 
a big deal).  
if so that would seem sufficient for all but the most distributed of Hash* 
containers, such as where a client which is remote from the storage (i.e, in 
another JVM) determines the bucket (based on hashCode()) to find that the 
element in question has been placed into another bucket because the hashCode() 
within the containing JVM has evaluated to another value.  that's a pretty 
far-fetched, but not unimaginable, situation.




From: Jason Hsueh jas...@google.com
To: Jay Booth jaybo...@gmail.com
Cc: Protocol Buffers protobuf@googlegroups.com
Sent: Wed, May 18, 2011 12:00:19 PM
Subject: Re: [protobuf] Re: Generated hashcode() returns different values 
across 
JVM instances?

Jumping in late to the thread, and I'm not really a Java person, so I may be 
misunderstanding something here. But as far as I can tell, you are asking for 
hashCode() to be a 'consistent' hash across processes. hashCode() as 
implemented 
is still useful within a single JVM, allowing you to use protobufs in HashMaps 
based on content rather than object identity. That was the intended use case.


On Wed, May 18, 2011 at 11:48 AM, Jay Booth jaybo...@gmail.com wrote:

Well, that's your prerogative, I guess, but why even implement
hashcode at all then?  Just inherit from object and you're getting
effectively the same behavior.  Is that what you're intending?


On May 16, 10:03 am, Pherl Liu liuj...@google.com wrote:
 We discussed internally and decided not to make the hashCode()
 return deterministic result. If you need consistent hashcode in different
 runs, use toByteString().hashCode().

 Quoted from Kenton:

 Hashing the content of the descriptor would actually be incorrect, because
 two descriptors with exactly the same content are still considered different
 types.  Descriptors are compared by identity, hence they are hashed by
 pointer.

 Removing the descriptor from the calculation would indeed make hashCode()
 consistent between two runs of the same binary, and probably insignificant
 runtime cost.  Of course, once you do that, you will never be able to
 introduce non-determinism again because people will depend on it.

 But there's a much bigger risk.  People may actually start depending on
 hashCode() returning consistent results between two different versions of
 the binary, or two completely separate binaries that compile in the same
 protocol, or -- most dangerously -- two different versions of the same
 protocol (e.g. with fields added or removed).  I think it would be very
 difficult and limiting to make these guarantees, so I would be extremely
 cautious about this.

 Certainly, there is no implementation of hashCode() that would be any safer
 than .toByteString().hashCode().  So, I'd advise steering people to the
 latter.  Note that if unknown fields are present, the results may still be
 inconsistent.  However, there is no reasonable way to implement a hashCode()
 that is consistent in the presence of unknown fields.








 On Thu, May 12, 2011 at 5:32 AM, Ben Wright compuware...@gmail.com wrote:
  I think we wrote those replies at the same time : )

  You're right, at the cost of some additional hash collisions, the
  simplest solution is to simply not include the type / descriptor in
  the hash calculation at all.

  The best / least-collision solutions with good performance would be
  what I wrote in my previous post, but that requires that someone
  (presumably a current committer) with sufficient knowledge of the
  Descriptor types to have enough time to update the compiler and java
  libraries accordingly.

  Any input from a committer for this issue?  Seems the simple solution
  would take less than an hour to push into the stream and could make it
  into the next release.

  On May 11, 5:25 pm, Ben Wright compuware...@gmail.com wrote:
   Alternatively... instead of putting the onus on the compiler, the
   hashcode could be computed by the JVM at initialization time for the
   Descriptor instance, (which would also help performance of dynamically
   parsed Descriptor instance hashcode calls).

   i.e.

   private final int computedHashcode;

   public Descriptor() {
  //initialization

 computedHashcode = do_compute_hashCode();

   }

   public int hashCode() {
   return computedHashcode;

   }

   punlic int do_compute_hashCode(){
 return // compute hashcode

   }

   This is all talking towards optimum performance implementation... the
   real problem is the need for a hashCode implementation for Descriptor
   based on the actual Descriptor's content...

   On May 11, 4:54 pm, Ben Wright compuware...@gmail.com wrote:

Jay:

Using the class name to generate

Re: [protobuf] Re: Generated hashcode() returns different values across JVM instances?

2011-05-18 Thread Jason Hsueh
On Wed, May 18, 2011 at 12:08 PM, Ron Reynolds tequila...@ymail.com wrote:

 the corner-stone of Hash* containers is:
  (A.equals(B)) = (A.hashCode() == B.hashCode()) for all A, B.


From what Ive gathered, this invariant is true for protocol buffer's
implementation of equals() and hashCode(). The descriptor object identity is
included in both methods.

-- 
You received this message because you are subscribed to the Google Groups 
Protocol Buffers group.
To post to this group, send email to protobuf@googlegroups.com.
To unsubscribe from this group, send email to 
protobuf+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/protobuf?hl=en.



[protobuf] Re: Generated hashcode() returns different values across JVM instances?

2011-05-18 Thread Jay Booth
Right, I forgot that the Descriptor object is statically initialized
so it'll be consistent within the JVM.

I still think that in the context of serializable objects, (i.e.
objects intended to be transported between JVMs in some way, shape or
form) a consistent hashCode would be useful for a lot of cases (mine
included).  If it can't be consistent in the presence of unknown
fields, perhaps a well-documented caution to that (relatively
uncommon?) case would be more useful than completely punting on all
cases.

On May 18, 3:08 pm, Ron Reynolds tequila...@ymail.com wrote:
 the corner-stone of Hash* containers is:
  (A.equals(B)) = (A.hashCode() == B.hashCode()) for all A, B.  

 tho it's not explicitly stated it would seem to be implied that is within a
 single JVM.
 not sure if the code in question maintains that rule within a JVM (if not 
 that's
 a big deal).  
 if so that would seem sufficient for all but the most distributed of Hash*
 containers, such as where a client which is remote from the storage (i.e, in
 another JVM) determines the bucket (based on hashCode()) to find that the
 element in question has been placed into another bucket because the hashCode()
 within the containing JVM has evaluated to another value.  that's a pretty
 far-fetched, but not unimaginable, situation.

 
 From: Jason Hsueh jas...@google.com
 To: Jay Booth jaybo...@gmail.com
 Cc: Protocol Buffers protobuf@googlegroups.com
 Sent: Wed, May 18, 2011 12:00:19 PM
 Subject: Re: [protobuf] Re: Generated hashcode() returns different values 
 across
 JVM instances?

 Jumping in late to the thread, and I'm not really a Java person, so I may be
 misunderstanding something here. But as far as I can tell, you are asking for
 hashCode() to be a 'consistent' hash across processes. hashCode() as 
 implemented
 is still useful within a single JVM, allowing you to use protobufs in HashMaps
 based on content rather than object identity. That was the intended use case.

 On Wed, May 18, 2011 at 11:48 AM, Jay Booth jaybo...@gmail.com wrote:

 Well, that's your prerogative, I guess, but why even implement









 hashcode at all then?  Just inherit from object and you're getting
 effectively the same behavior.  Is that what you're intending?

 On May 16, 10:03 am, Pherl Liu liuj...@google.com wrote:
  We discussed internally and decided not to make the hashCode()
  return deterministic result. If you need consistent hashcode in different
  runs, use toByteString().hashCode().

  Quoted from Kenton:

  Hashing the content of the descriptor would actually be incorrect, because
  two descriptors with exactly the same content are still considered 
  different
  types.  Descriptors are compared by identity, hence they are hashed by
  pointer.

  Removing the descriptor from the calculation would indeed make hashCode()
  consistent between two runs of the same binary, and probably insignificant
  runtime cost.  Of course, once you do that, you will never be able to
  introduce non-determinism again because people will depend on it.

  But there's a much bigger risk.  People may actually start depending on
  hashCode() returning consistent results between two different versions of
  the binary, or two completely separate binaries that compile in the same
  protocol, or -- most dangerously -- two different versions of the same
  protocol (e.g. with fields added or removed).  I think it would be very
  difficult and limiting to make these guarantees, so I would be extremely
  cautious about this.

  Certainly, there is no implementation of hashCode() that would be any safer
  than .toByteString().hashCode().  So, I'd advise steering people to the
  latter.  Note that if unknown fields are present, the results may still be
  inconsistent.  However, there is no reasonable way to implement a 
  hashCode()
  that is consistent in the presence of unknown fields.

  On Thu, May 12, 2011 at 5:32 AM, Ben Wright compuware...@gmail.com wrote:
   I think we wrote those replies at the same time : )

   You're right, at the cost of some additional hash collisions, the
   simplest solution is to simply not include the type / descriptor in
   the hash calculation at all.

   The best / least-collision solutions with good performance would be
   what I wrote in my previous post, but that requires that someone
   (presumably a current committer) with sufficient knowledge of the
   Descriptor types to have enough time to update the compiler and java
   libraries accordingly.

   Any input from a committer for this issue?  Seems the simple solution
   would take less than an hour to push into the stream and could make it
   into the next release.

   On May 11, 5:25 pm, Ben Wright compuware...@gmail.com wrote:
Alternatively... instead of putting the onus on the compiler, the
hashcode could be computed by the JVM at initialization time for the
Descriptor instance, (which would also help performance of dynamically
parsed

Re: [protobuf] Re: Generated hashcode() returns different values across JVM instances?

2011-05-18 Thread Jason Hsueh
Well, hashing the serialization as Pherl/Kenton suggested gives you that,
and seems cleaner and more direct than making hashCode() of the in-memory
object assume that you are doing something across JVMs.

(Also, unknown fields are pretty common when you may have a jungle of
binaries, which is true here.)

On Wed, May 18, 2011 at 12:19 PM, Jay Booth jaybo...@gmail.com wrote:

 Right, I forgot that the Descriptor object is statically initialized
 so it'll be consistent within the JVM.

 I still think that in the context of serializable objects, (i.e.
 objects intended to be transported between JVMs in some way, shape or
 form) a consistent hashCode would be useful for a lot of cases (mine
 included).  If it can't be consistent in the presence of unknown
 fields, perhaps a well-documented caution to that (relatively
 uncommon?) case would be more useful than completely punting on all
 cases.

 On May 18, 3:08 pm, Ron Reynolds tequila...@ymail.com wrote:
  the corner-stone of Hash* containers is:
   (A.equals(B)) = (A.hashCode() == B.hashCode()) for all A, B.
 
  tho it's not explicitly stated it would seem to be implied that is within
 a
  single JVM.
  not sure if the code in question maintains that rule within a JVM (if not
 that's
  a big deal).
  if so that would seem sufficient for all but the most distributed of
 Hash*
  containers, such as where a client which is remote from the storage (i.e,
 in
  another JVM) determines the bucket (based on hashCode()) to find that the
  element in question has been placed into another bucket because the
 hashCode()
  within the containing JVM has evaluated to another value.  that's a
 pretty
  far-fetched, but not unimaginable, situation.
 
  
  From: Jason Hsueh jas...@google.com
  To: Jay Booth jaybo...@gmail.com
  Cc: Protocol Buffers protobuf@googlegroups.com
  Sent: Wed, May 18, 2011 12:00:19 PM
  Subject: Re: [protobuf] Re: Generated hashcode() returns different values
 across
  JVM instances?
 
  Jumping in late to the thread, and I'm not really a Java person, so I may
 be
  misunderstanding something here. But as far as I can tell, you are asking
 for
  hashCode() to be a 'consistent' hash across processes. hashCode() as
 implemented
  is still useful within a single JVM, allowing you to use protobufs in
 HashMaps
  based on content rather than object identity. That was the intended use
 case.
 
  On Wed, May 18, 2011 at 11:48 AM, Jay Booth jaybo...@gmail.com wrote:
 
  Well, that's your prerogative, I guess, but why even implement
 
 
 
 
 
 
 
 
 
  hashcode at all then?  Just inherit from object and you're getting
  effectively the same behavior.  Is that what you're intending?
 
  On May 16, 10:03 am, Pherl Liu liuj...@google.com wrote:
   We discussed internally and decided not to make the hashCode()
   return deterministic result. If you need consistent hashcode in
 different
   runs, use toByteString().hashCode().
 
   Quoted from Kenton:
 
   Hashing the content of the descriptor would actually be incorrect,
 because
   two descriptors with exactly the same content are still considered
 different
   types.  Descriptors are compared by identity, hence they are hashed by
   pointer.
 
   Removing the descriptor from the calculation would indeed make
 hashCode()
   consistent between two runs of the same binary, and probably
 insignificant
   runtime cost.  Of course, once you do that, you will never be able to
   introduce non-determinism again because people will depend on it.
 
   But there's a much bigger risk.  People may actually start depending
 on
   hashCode() returning consistent results between two different versions
 of
   the binary, or two completely separate binaries that compile in the
 same
   protocol, or -- most dangerously -- two different versions of the same
   protocol (e.g. with fields added or removed).  I think it would be
 very
   difficult and limiting to make these guarantees, so I would be
 extremely
   cautious about this.
 
   Certainly, there is no implementation of hashCode() that would be any
 safer
   than .toByteString().hashCode().  So, I'd advise steering people to
 the
   latter.  Note that if unknown fields are present, the results may
 still be
   inconsistent.  However, there is no reasonable way to implement a
 hashCode()
   that is consistent in the presence of unknown fields.
 
   On Thu, May 12, 2011 at 5:32 AM, Ben Wright compuware...@gmail.com
 wrote:
I think we wrote those replies at the same time : )
 
You're right, at the cost of some additional hash collisions, the
simplest solution is to simply not include the type / descriptor in
the hash calculation at all.
 
The best / least-collision solutions with good performance would be
what I wrote in my previous post, but that requires that someone
(presumably a current committer) with sufficient knowledge of the
Descriptor types to have enough time to update the compiler and java

Re: [protobuf] Re: Generated hashcode() returns different values across JVM instances?

2011-05-18 Thread Jay Booth
That definitely seems acceptable.  Any reason not to make the default
implementation return toByteString().hashCode() and get the best of
both worlds?  It's a little more expensive probably, but
Object.hashCode() is already deceptively expensive to begin with.

I personally switched my particular problem to use Strings as keys,
since that works for the use case here and is easier for corresponding
command-line tools, I just generally see a consistent hashcode as a
good idea if supportable.

On Wed, May 18, 2011 at 3:33 PM, Jason Hsueh jas...@google.com wrote:
 Well, hashing the serialization as Pherl/Kenton suggested gives you that,
 and seems cleaner and more direct than making hashCode() of the in-memory
 object assume that you are doing something across JVMs.
 (Also, unknown fields are pretty common when you may have a jungle of
 binaries, which is true here.)

 On Wed, May 18, 2011 at 12:19 PM, Jay Booth jaybo...@gmail.com wrote:

 Right, I forgot that the Descriptor object is statically initialized
 so it'll be consistent within the JVM.

 I still think that in the context of serializable objects, (i.e.
 objects intended to be transported between JVMs in some way, shape or
 form) a consistent hashCode would be useful for a lot of cases (mine
 included).  If it can't be consistent in the presence of unknown
 fields, perhaps a well-documented caution to that (relatively
 uncommon?) case would be more useful than completely punting on all
 cases.

 On May 18, 3:08 pm, Ron Reynolds tequila...@ymail.com wrote:
  the corner-stone of Hash* containers is:
   (A.equals(B)) = (A.hashCode() == B.hashCode()) for all A, B.
 
  tho it's not explicitly stated it would seem to be implied that is
  within a
  single JVM.
  not sure if the code in question maintains that rule within a JVM (if
  not that's
  a big deal).
  if so that would seem sufficient for all but the most distributed of
  Hash*
  containers, such as where a client which is remote from the storage
  (i.e, in
  another JVM) determines the bucket (based on hashCode()) to find that
  the
  element in question has been placed into another bucket because the
  hashCode()
  within the containing JVM has evaluated to another value.  that's a
  pretty
  far-fetched, but not unimaginable, situation.
 
  
  From: Jason Hsueh jas...@google.com
  To: Jay Booth jaybo...@gmail.com
  Cc: Protocol Buffers protobuf@googlegroups.com
  Sent: Wed, May 18, 2011 12:00:19 PM
  Subject: Re: [protobuf] Re: Generated hashcode() returns different
  values across
  JVM instances?
 
  Jumping in late to the thread, and I'm not really a Java person, so I
  may be
  misunderstanding something here. But as far as I can tell, you are
  asking for
  hashCode() to be a 'consistent' hash across processes. hashCode() as
  implemented
  is still useful within a single JVM, allowing you to use protobufs in
  HashMaps
  based on content rather than object identity. That was the intended use
  case.
 
  On Wed, May 18, 2011 at 11:48 AM, Jay Booth jaybo...@gmail.com wrote:
 
  Well, that's your prerogative, I guess, but why even implement
 
 
 
 
 
 
 
 
 
  hashcode at all then?  Just inherit from object and you're getting
  effectively the same behavior.  Is that what you're intending?
 
  On May 16, 10:03 am, Pherl Liu liuj...@google.com wrote:
   We discussed internally and decided not to make the hashCode()
   return deterministic result. If you need consistent hashcode in
   different
   runs, use toByteString().hashCode().
 
   Quoted from Kenton:
 
   Hashing the content of the descriptor would actually be incorrect,
   because
   two descriptors with exactly the same content are still considered
   different
   types.  Descriptors are compared by identity, hence they are hashed
   by
   pointer.
 
   Removing the descriptor from the calculation would indeed make
   hashCode()
   consistent between two runs of the same binary, and probably
   insignificant
   runtime cost.  Of course, once you do that, you will never be able to
   introduce non-determinism again because people will depend on it.
 
   But there's a much bigger risk.  People may actually start depending
   on
   hashCode() returning consistent results between two different
   versions of
   the binary, or two completely separate binaries that compile in the
   same
   protocol, or -- most dangerously -- two different versions of the
   same
   protocol (e.g. with fields added or removed).  I think it would be
   very
   difficult and limiting to make these guarantees, so I would be
   extremely
   cautious about this.
 
   Certainly, there is no implementation of hashCode() that would be any
   safer
   than .toByteString().hashCode().  So, I'd advise steering people to
   the
   latter.  Note that if unknown fields are present, the results may
   still be
   inconsistent.  However, there is no reasonable way to implement a
   hashCode()
   that is consistent in the presence of unknown

Re: [protobuf] Re: Generated hashcode() returns different values across JVM instances?

2011-05-18 Thread Jason Hsueh
On Wed, May 18, 2011 at 12:44 PM, Jay Booth jaybo...@gmail.com wrote:

 That definitely seems acceptable.  Any reason not to make the default
 implementation return toByteString().hashCode() and get the best of
 both worlds?  It's a little more expensive probably, but
 Object.hashCode() is already deceptively expensive to begin with.


I'm not well-enough versed in Java to say with authority, but as far as I
can tell, hashCode() is intended for HashMap/HashSet, and assuming that
Object.hashCode() provides some consistent hash is incorrect. Yes, protobuf
could attempt to provide a consistent hash with toByteString().hashCode(),
except we already know that there are caveats to that, so it's not like we
can provide any contractual guarantee. Applications like yours that need a
stable hash could implement a wrapper object that overrides hashCode() to
use this approach, or perhaps simply use a different method altogether to
express the stability requirement.


 I personally switched my particular problem to use Strings as keys,


I think you want to avoid using Strings to store the serializations since
this is binary data, and String will coerce that to UTF.


 since that works for the use case here and is easier for corresponding
 command-line tools, I just generally see a consistent hashcode as a
 good idea if supportable.

 On Wed, May 18, 2011 at 3:33 PM, Jason Hsueh jas...@google.com wrote:
  Well, hashing the serialization as Pherl/Kenton suggested gives you that,
  and seems cleaner and more direct than making hashCode() of the in-memory
  object assume that you are doing something across JVMs.
  (Also, unknown fields are pretty common when you may have a jungle of
  binaries, which is true here.)
 
  On Wed, May 18, 2011 at 12:19 PM, Jay Booth jaybo...@gmail.com wrote:
 
  Right, I forgot that the Descriptor object is statically initialized
  so it'll be consistent within the JVM.
 
  I still think that in the context of serializable objects, (i.e.
  objects intended to be transported between JVMs in some way, shape or
  form) a consistent hashCode would be useful for a lot of cases (mine
  included).  If it can't be consistent in the presence of unknown
  fields, perhaps a well-documented caution to that (relatively
  uncommon?) case would be more useful than completely punting on all
  cases.
 
  On May 18, 3:08 pm, Ron Reynolds tequila...@ymail.com wrote:
   the corner-stone of Hash* containers is:
(A.equals(B)) = (A.hashCode() == B.hashCode()) for all A, B.
  
   tho it's not explicitly stated it would seem to be implied that is
   within a
   single JVM.
   not sure if the code in question maintains that rule within a JVM (if
   not that's
   a big deal).
   if so that would seem sufficient for all but the most distributed of
   Hash*
   containers, such as where a client which is remote from the storage
   (i.e, in
   another JVM) determines the bucket (based on hashCode()) to find that
   the
   element in question has been placed into another bucket because the
   hashCode()
   within the containing JVM has evaluated to another value.  that's a
   pretty
   far-fetched, but not unimaginable, situation.
  
   
   From: Jason Hsueh jas...@google.com
   To: Jay Booth jaybo...@gmail.com
   Cc: Protocol Buffers protobuf@googlegroups.com
   Sent: Wed, May 18, 2011 12:00:19 PM
   Subject: Re: [protobuf] Re: Generated hashcode() returns different
   values across
   JVM instances?
  
   Jumping in late to the thread, and I'm not really a Java person, so I
   may be
   misunderstanding something here. But as far as I can tell, you are
   asking for
   hashCode() to be a 'consistent' hash across processes. hashCode() as
   implemented
   is still useful within a single JVM, allowing you to use protobufs in
   HashMaps
   based on content rather than object identity. That was the intended
 use
   case.
  
   On Wed, May 18, 2011 at 11:48 AM, Jay Booth jaybo...@gmail.com
 wrote:
  
   Well, that's your prerogative, I guess, but why even implement
  
  
  
  
  
  
  
  
  
   hashcode at all then?  Just inherit from object and you're getting
   effectively the same behavior.  Is that what you're intending?
  
   On May 16, 10:03 am, Pherl Liu liuj...@google.com wrote:
We discussed internally and decided not to make the hashCode()
return deterministic result. If you need consistent hashcode in
different
runs, use toByteString().hashCode().
  
Quoted from Kenton:
  
Hashing the content of the descriptor would actually be incorrect,
because
two descriptors with exactly the same content are still considered
different
types.  Descriptors are compared by identity, hence they are hashed
by
pointer.
  
Removing the descriptor from the calculation would indeed make
hashCode()
consistent between two runs of the same binary, and probably
insignificant
runtime cost.  Of course, once you do that, you will never

Re: [protobuf] Re: Generated hashcode() returns different values across JVM instances?

2011-05-16 Thread Pherl Liu
We discussed internally and decided not to make the hashCode()
return deterministic result. If you need consistent hashcode in different
runs, use toByteString().hashCode().

Quoted from Kenton:

Hashing the content of the descriptor would actually be incorrect, because
two descriptors with exactly the same content are still considered different
types.  Descriptors are compared by identity, hence they are hashed by
pointer.

Removing the descriptor from the calculation would indeed make hashCode()
consistent between two runs of the same binary, and probably insignificant
runtime cost.  Of course, once you do that, you will never be able to
introduce non-determinism again because people will depend on it.

But there's a much bigger risk.  People may actually start depending on
hashCode() returning consistent results between two different versions of
the binary, or two completely separate binaries that compile in the same
protocol, or -- most dangerously -- two different versions of the same
protocol (e.g. with fields added or removed).  I think it would be very
difficult and limiting to make these guarantees, so I would be extremely
cautious about this.

Certainly, there is no implementation of hashCode() that would be any safer
than .toByteString().hashCode().  So, I'd advise steering people to the
latter.  Note that if unknown fields are present, the results may still be
inconsistent.  However, there is no reasonable way to implement a hashCode()
that is consistent in the presence of unknown fields.

On Thu, May 12, 2011 at 5:32 AM, Ben Wright compuware...@gmail.com wrote:

 I think we wrote those replies at the same time : )

 You're right, at the cost of some additional hash collisions, the
 simplest solution is to simply not include the type / descriptor in
 the hash calculation at all.

 The best / least-collision solutions with good performance would be
 what I wrote in my previous post, but that requires that someone
 (presumably a current committer) with sufficient knowledge of the
 Descriptor types to have enough time to update the compiler and java
 libraries accordingly.

 Any input from a committer for this issue?  Seems the simple solution
 would take less than an hour to push into the stream and could make it
 into the next release.

 On May 11, 5:25 pm, Ben Wright compuware...@gmail.com wrote:
  Alternatively... instead of putting the onus on the compiler, the
  hashcode could be computed by the JVM at initialization time for the
  Descriptor instance, (which would also help performance of dynamically
  parsed Descriptor instance hashcode calls).
 
  i.e.
 
  private final int computedHashcode;
 
  public Descriptor() {
 //initialization
 
computedHashcode = do_compute_hashCode();
 
  }
 
  public int hashCode() {
  return computedHashcode;
 
  }
 
  punlic int do_compute_hashCode(){
return // compute hashcode
 
  }
 
  This is all talking towards optimum performance implementation... the
  real problem is the need for a hashCode implementation for Descriptor
  based on the actual Descriptor's content...
 
  On May 11, 4:54 pm, Ben Wright compuware...@gmail.com wrote:
 
 
 
 
 
 
 
   Jay:
 
   Using the class name to generate the hashcode is logically incorrect
   because the class name can be derived by the options java_package_
   name and java_outer_classname.
 
   Additionally (although less likely to matter), separate protocol
   buffer files can define an identical class names with different
   protocol buffers.
 
   Lastly, and most importantly...
 
   If the same Message is being used with generated code and with dynamic
   code, the hash code for the descriptor would still be identical if
   generated from the descriptor instance, whereas the dynamic usage does
   not have a classname from which to derive a hashcode.  While in your
   case this should not matter, it does matter for other users of
   protobuf.  The hashcode function would be better served by being
   implemented correctly from state data for the descriptor.
   Additionally, in generated code it seems that this hashcode could be
   pre-computed by the compiler and Descriptor.hashcode() could return a
   constant integer - which would be much more efficient than any other
   method.
 
   On May 11, 3:02 pm, Jay Booth jaybo...@gmail.com wrote:
 
It can be legitimate, especially in the case of Object.hashCode(),
 but
it's supposed to be in sync with equals() by contract.  As it stands,
two objects which are equal() will produce different hashes, or the
same logical object will produce different hashes across JVMs.  That
breaks the contract..  if the equals() method simply did return
 (other
== this), then it'd be fine, albeit a little useless.
 
I created an issue and posted a 1-liner patch that would eliminate
 the
problem by using getClass().getName().hashCode() to incorporate type
information into the hashCode without depending on a Descriptor
object's memory 

[protobuf] Re: Generated hashcode() returns different values across JVM instances?

2011-05-11 Thread Jay Booth
It can be legitimate, especially in the case of Object.hashCode(), but
it's supposed to be in sync with equals() by contract.  As it stands,
two objects which are equal() will produce different hashes, or the
same logical object will produce different hashes across JVMs.  That
breaks the contract..  if the equals() method simply did return (other
== this), then it'd be fine, albeit a little useless.

I created an issue and posted a 1-liner patch that would eliminate the
problem by using getClass().getName().hashCode() to incorporate type
information into the hashCode without depending on a Descriptor
object's memory address.

On May 11, 12:01 am, Dmitriy Ryaboy dvrya...@gmail.com wrote:
 Hi Jay,

 I encountered that before. Unfortunately this is a legitimate thing to
 do, as documented in Object.hashCode()

 I have a write-up of the problem and how we wound up solving it (not
 elegant.. suggestions welcome) 
 here:http://squarecog.wordpress.com/2011/02/20/hadoop-requires-stable-hash...

 D







 On Mon, May 9, 2011 at 8:25 AM, Jay Booth jaybo...@gmail.com wrote:
  I'm testing an on-disk hashtable with Protobufs and noticed that with
  the java generated hashcode function, it seems to return a different
  hashcode across JVM invocations for the same logically equivalent
  object (tested with a single string protobuf, same string for both
  instances).

  Is this known behavior?  Bit busy right now backporting this to work
  with String keys instead but I could provide a bit of command line
  code that demonstrates the issue when I get a chance.

  Glancing at the generated hashcode() function, it looks like the
  difference comes from etiher getDescriptorForType().hashCode() or
  getUnknownFields().hashCode(), both of which are incorporated.

  --
  You received this message because you are subscribed to the Google Groups 
  Protocol Buffers group.
  To post to this group, send email to protobuf@googlegroups.com.
  To unsubscribe from this group, send email to 
  protobuf+unsubscr...@googlegroups.com.
  For more options, visit this group 
  athttp://groups.google.com/group/protobuf?hl=en.

-- 
You received this message because you are subscribed to the Google Groups 
Protocol Buffers group.
To post to this group, send email to protobuf@googlegroups.com.
To unsubscribe from this group, send email to 
protobuf+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/protobuf?hl=en.



[protobuf] Re: Generated hashcode() returns different values across JVM instances?

2011-05-11 Thread Ben Wright
Jay:

Using the class name to generate the hashcode is logically incorrect
because the class name can be derived by the options java_package_
name and java_outer_classname.

Additionally (although less likely to matter), separate protocol
buffer files can define an identical class names with different
protocol buffers.

Lastly, and most importantly...

If the same Message is being used with generated code and with dynamic
code, the hash code for the descriptor would still be identical if
generated from the descriptor instance, whereas the dynamic usage does
not have a classname from which to derive a hashcode.  While in your
case this should not matter, it does matter for other users of
protobuf.  The hashcode function would be better served by being
implemented correctly from state data for the descriptor.
Additionally, in generated code it seems that this hashcode could be
pre-computed by the compiler and Descriptor.hashcode() could return a
constant integer - which would be much more efficient than any other
method.


On May 11, 3:02 pm, Jay Booth jaybo...@gmail.com wrote:
 It can be legitimate, especially in the case of Object.hashCode(), but
 it's supposed to be in sync with equals() by contract.  As it stands,
 two objects which are equal() will produce different hashes, or the
 same logical object will produce different hashes across JVMs.  That
 breaks the contract..  if the equals() method simply did return (other
 == this), then it'd be fine, albeit a little useless.

 I created an issue and posted a 1-liner patch that would eliminate the
 problem by using getClass().getName().hashCode() to incorporate type
 information into the hashCode without depending on a Descriptor
 object's memory address.

 On May 11, 12:01 am, Dmitriy Ryaboy dvrya...@gmail.com wrote:







  Hi Jay,

  I encountered that before. Unfortunately this is a legitimate thing to
  do, as documented in Object.hashCode()

  I have a write-up of the problem and how we wound up solving it (not
  elegant.. suggestions welcome) 
  here:http://squarecog.wordpress.com/2011/02/20/hadoop-requires-stable-hash...

  D

  On Mon, May 9, 2011 at 8:25 AM, Jay Booth jaybo...@gmail.com wrote:
   I'm testing an on-disk hashtable with Protobufs and noticed that with
   the java generated hashcode function, it seems to return a different
   hashcode across JVM invocations for the same logically equivalent
   object (tested with a single string protobuf, same string for both
   instances).

   Is this known behavior?  Bit busy right now backporting this to work
   with String keys instead but I could provide a bit of command line
   code that demonstrates the issue when I get a chance.

   Glancing at the generated hashcode() function, it looks like the
   difference comes from etiher getDescriptorForType().hashCode() or
   getUnknownFields().hashCode(), both of which are incorporated.

   --
   You received this message because you are subscribed to the Google Groups 
   Protocol Buffers group.
   To post to this group, send email to protobuf@googlegroups.com.
   To unsubscribe from this group, send email to 
   protobuf+unsubscr...@googlegroups.com.
   For more options, visit this group 
   athttp://groups.google.com/group/protobuf?hl=en.

-- 
You received this message because you are subscribed to the Google Groups 
Protocol Buffers group.
To post to this group, send email to protobuf@googlegroups.com.
To unsubscribe from this group, send email to 
protobuf+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/protobuf?hl=en.



[protobuf] Re: Generated hashcode() returns different values across JVM instances?

2011-05-11 Thread Jay Booth
Well, to sidestep the whole discussion, we could just eliminate types
from hashcode completely and work strictly on field hashcodes.  The
type check is there in equals(), hashmaps will function correctly, and
hash collisions are already allowed between inequal objects, they're a
fact of life.   Return 1; would be legal, it would just be a really
bad idea for anything that actually uses the hashcode.

I figured getClass().getName() would provide the additional hash
spread benefit in the majority of cases and HashMaps and the like
could fall back on equals() as they always do upon the rare collision
between differently-typed-but-same-fields objects.  Descriptor has a
bunch of stuff attached to it, which would lead to a potentially
complicated/expensive hashCode() implementation and I didn't
understand that code well enough to suggest a patch.  I also don't
know the protobuf compiler well enough to have an opinion on
generating a constant per-class as you suggest, I could see it leading
to trouble when you regenerate classes that haven't changed unless
it's deterministic somehow.  Just going with the KISS principle.



On May 11, 4:54 pm, Ben Wright compuware...@gmail.com wrote:
 Jay:

 Using the class name to generate the hashcode is logically incorrect
 because the class name can be derived by the options java_package_
 name and java_outer_classname.

 Additionally (although less likely to matter), separate protocol
 buffer files can define an identical class names with different
 protocol buffers.

 Lastly, and most importantly...

 If the same Message is being used with generated code and with dynamic
 code, the hash code for the descriptor would still be identical if
 generated from the descriptor instance, whereas the dynamic usage does
 not have a classname from which to derive a hashcode.  While in your
 case this should not matter, it does matter for other users of
 protobuf.  The hashcode function would be better served by being
 implemented correctly from state data for the descriptor.
 Additionally, in generated code it seems that this hashcode could be
 pre-computed by the compiler and Descriptor.hashcode() could return a
 constant integer - which would be much more efficient than any other
 method.

 On May 11, 3:02 pm, Jay Booth jaybo...@gmail.com wrote:







  It can be legitimate, especially in the case of Object.hashCode(), but
  it's supposed to be in sync with equals() by contract.  As it stands,
  two objects which are equal() will produce different hashes, or the
  same logical object will produce different hashes across JVMs.  That
  breaks the contract..  if the equals() method simply did return (other
  == this), then it'd be fine, albeit a little useless.

  I created an issue and posted a 1-liner patch that would eliminate the
  problem by using getClass().getName().hashCode() to incorporate type
  information into the hashCode without depending on a Descriptor
  object's memory address.

  On May 11, 12:01 am, Dmitriy Ryaboy dvrya...@gmail.com wrote:

   Hi Jay,

   I encountered that before. Unfortunately this is a legitimate thing to
   do, as documented in Object.hashCode()

   I have a write-up of the problem and how we wound up solving it (not
   elegant.. suggestions welcome) 
   here:http://squarecog.wordpress.com/2011/02/20/hadoop-requires-stable-hash...

   D

   On Mon, May 9, 2011 at 8:25 AM, Jay Booth jaybo...@gmail.com wrote:
I'm testing an on-disk hashtable with Protobufs and noticed that with
the java generated hashcode function, it seems to return a different
hashcode across JVM invocations for the same logically equivalent
object (tested with a single string protobuf, same string for both
instances).

Is this known behavior?  Bit busy right now backporting this to work
with String keys instead but I could provide a bit of command line
code that demonstrates the issue when I get a chance.

Glancing at the generated hashcode() function, it looks like the
difference comes from etiher getDescriptorForType().hashCode() or
getUnknownFields().hashCode(), both of which are incorporated.

--
You received this message because you are subscribed to the Google 
Groups Protocol Buffers group.
To post to this group, send email to protobuf@googlegroups.com.
To unsubscribe from this group, send email to 
protobuf+unsubscr...@googlegroups.com.
For more options, visit this group 
athttp://groups.google.com/group/protobuf?hl=en.

-- 
You received this message because you are subscribed to the Google Groups 
Protocol Buffers group.
To post to this group, send email to protobuf@googlegroups.com.
To unsubscribe from this group, send email to 
protobuf+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/protobuf?hl=en.



[protobuf] Re: Generated hashcode() returns different values across JVM instances?

2011-05-11 Thread Ben Wright
Alternatively... instead of putting the onus on the compiler, the
hashcode could be computed by the JVM at initialization time for the
Descriptor instance, (which would also help performance of dynamically
parsed Descriptor instance hashcode calls).

i.e.

private final int computedHashcode;

public Descriptor() {
   //initialization

  computedHashcode = do_compute_hashCode();
}

public int hashCode() {
return computedHashcode;
}

punlic int do_compute_hashCode(){
  return // compute hashcode
}

This is all talking towards optimum performance implementation... the
real problem is the need for a hashCode implementation for Descriptor
based on the actual Descriptor's content...


On May 11, 4:54 pm, Ben Wright compuware...@gmail.com wrote:
 Jay:

 Using the class name to generate the hashcode is logically incorrect
 because the class name can be derived by the options java_package_
 name and java_outer_classname.

 Additionally (although less likely to matter), separate protocol
 buffer files can define an identical class names with different
 protocol buffers.

 Lastly, and most importantly...

 If the same Message is being used with generated code and with dynamic
 code, the hash code for the descriptor would still be identical if
 generated from the descriptor instance, whereas the dynamic usage does
 not have a classname from which to derive a hashcode.  While in your
 case this should not matter, it does matter for other users of
 protobuf.  The hashcode function would be better served by being
 implemented correctly from state data for the descriptor.
 Additionally, in generated code it seems that this hashcode could be
 pre-computed by the compiler and Descriptor.hashcode() could return a
 constant integer - which would be much more efficient than any other
 method.

 On May 11, 3:02 pm, Jay Booth jaybo...@gmail.com wrote:







  It can be legitimate, especially in the case of Object.hashCode(), but
  it's supposed to be in sync with equals() by contract.  As it stands,
  two objects which are equal() will produce different hashes, or the
  same logical object will produce different hashes across JVMs.  That
  breaks the contract..  if the equals() method simply did return (other
  == this), then it'd be fine, albeit a little useless.

  I created an issue and posted a 1-liner patch that would eliminate the
  problem by using getClass().getName().hashCode() to incorporate type
  information into the hashCode without depending on a Descriptor
  object's memory address.

  On May 11, 12:01 am, Dmitriy Ryaboy dvrya...@gmail.com wrote:

   Hi Jay,

   I encountered that before. Unfortunately this is a legitimate thing to
   do, as documented in Object.hashCode()

   I have a write-up of the problem and how we wound up solving it (not
   elegant.. suggestions welcome) 
   here:http://squarecog.wordpress.com/2011/02/20/hadoop-requires-stable-hash...

   D

   On Mon, May 9, 2011 at 8:25 AM, Jay Booth jaybo...@gmail.com wrote:
I'm testing an on-disk hashtable with Protobufs and noticed that with
the java generated hashcode function, it seems to return a different
hashcode across JVM invocations for the same logically equivalent
object (tested with a single string protobuf, same string for both
instances).

Is this known behavior?  Bit busy right now backporting this to work
with String keys instead but I could provide a bit of command line
code that demonstrates the issue when I get a chance.

Glancing at the generated hashcode() function, it looks like the
difference comes from etiher getDescriptorForType().hashCode() or
getUnknownFields().hashCode(), both of which are incorporated.

--
You received this message because you are subscribed to the Google 
Groups Protocol Buffers group.
To post to this group, send email to protobuf@googlegroups.com.
To unsubscribe from this group, send email to 
protobuf+unsubscr...@googlegroups.com.
For more options, visit this group 
athttp://groups.google.com/group/protobuf?hl=en.

-- 
You received this message because you are subscribed to the Google Groups 
Protocol Buffers group.
To post to this group, send email to protobuf@googlegroups.com.
To unsubscribe from this group, send email to 
protobuf+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/protobuf?hl=en.



[protobuf] Re: Generated hashcode() returns different values across JVM instances?

2011-05-11 Thread Ben Wright
I think we wrote those replies at the same time : )

You're right, at the cost of some additional hash collisions, the
simplest solution is to simply not include the type / descriptor in
the hash calculation at all.

The best / least-collision solutions with good performance would be
what I wrote in my previous post, but that requires that someone
(presumably a current committer) with sufficient knowledge of the
Descriptor types to have enough time to update the compiler and java
libraries accordingly.

Any input from a committer for this issue?  Seems the simple solution
would take less than an hour to push into the stream and could make it
into the next release.

On May 11, 5:25 pm, Ben Wright compuware...@gmail.com wrote:
 Alternatively... instead of putting the onus on the compiler, the
 hashcode could be computed by the JVM at initialization time for the
 Descriptor instance, (which would also help performance of dynamically
 parsed Descriptor instance hashcode calls).

 i.e.

 private final int computedHashcode;

 public Descriptor() {
    //initialization

   computedHashcode = do_compute_hashCode();

 }

 public int hashCode() {
     return computedHashcode;

 }

 punlic int do_compute_hashCode(){
   return // compute hashcode

 }

 This is all talking towards optimum performance implementation... the
 real problem is the need for a hashCode implementation for Descriptor
 based on the actual Descriptor's content...

 On May 11, 4:54 pm, Ben Wright compuware...@gmail.com wrote:







  Jay:

  Using the class name to generate the hashcode is logically incorrect
  because the class name can be derived by the options java_package_
  name and java_outer_classname.

  Additionally (although less likely to matter), separate protocol
  buffer files can define an identical class names with different
  protocol buffers.

  Lastly, and most importantly...

  If the same Message is being used with generated code and with dynamic
  code, the hash code for the descriptor would still be identical if
  generated from the descriptor instance, whereas the dynamic usage does
  not have a classname from which to derive a hashcode.  While in your
  case this should not matter, it does matter for other users of
  protobuf.  The hashcode function would be better served by being
  implemented correctly from state data for the descriptor.
  Additionally, in generated code it seems that this hashcode could be
  pre-computed by the compiler and Descriptor.hashcode() could return a
  constant integer - which would be much more efficient than any other
  method.

  On May 11, 3:02 pm, Jay Booth jaybo...@gmail.com wrote:

   It can be legitimate, especially in the case of Object.hashCode(), but
   it's supposed to be in sync with equals() by contract.  As it stands,
   two objects which are equal() will produce different hashes, or the
   same logical object will produce different hashes across JVMs.  That
   breaks the contract..  if the equals() method simply did return (other
   == this), then it'd be fine, albeit a little useless.

   I created an issue and posted a 1-liner patch that would eliminate the
   problem by using getClass().getName().hashCode() to incorporate type
   information into the hashCode without depending on a Descriptor
   object's memory address.

   On May 11, 12:01 am, Dmitriy Ryaboy dvrya...@gmail.com wrote:

Hi Jay,

I encountered that before. Unfortunately this is a legitimate thing to
do, as documented in Object.hashCode()

I have a write-up of the problem and how we wound up solving it (not
elegant.. suggestions welcome) 
here:http://squarecog.wordpress.com/2011/02/20/hadoop-requires-stable-hash...

D

On Mon, May 9, 2011 at 8:25 AM, Jay Booth jaybo...@gmail.com wrote:
 I'm testing an on-disk hashtable with Protobufs and noticed that with
 the java generated hashcode function, it seems to return a different
 hashcode across JVM invocations for the same logically equivalent
 object (tested with a single string protobuf, same string for both
 instances).

 Is this known behavior?  Bit busy right now backporting this to work
 with String keys instead but I could provide a bit of command line
 code that demonstrates the issue when I get a chance.

 Glancing at the generated hashcode() function, it looks like the
 difference comes from etiher getDescriptorForType().hashCode() or
 getUnknownFields().hashCode(), both of which are incorporated.

 --
 You received this message because you are subscribed to the Google 
 Groups Protocol Buffers group.
 To post to this group, send email to protobuf@googlegroups.com.
 To unsubscribe from this group, send email to 
 protobuf+unsubscr...@googlegroups.com.
 For more options, visit this group 
 athttp://groups.google.com/group/protobuf?hl=en.

-- 
You received this message because you are subscribed to the Google Groups 
Protocol Buffers group.
To post