[ https://issues.apache.org/jira/browse/AVRO-1428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13872807#comment-13872807 ]
Doug Cutting commented on AVRO-1428: ------------------------------------ The current design is: - hashCode() re-computes if it's out of date, otherwise returns the cached value - computeHash() always re-computes. - subclasses override computeHash() That seems consistent and clear to me. If there are schema instances shared in multiple schemas then they may have their hashcode recomputed multiple times, once per schema that includes them, but that's not a common case that I've seen. The most common case is that schemas are parsed from JSON and do not share structure with other schemas. Moving the cache check into computeHash() on the base class (as you propose) means that only primitive classes have a cache: hashCode() on non-primitive schema subclasses will call computeHash() which will always perform the full calculation, since their base class no longer consults the cache. Since your benchmark only measures hashCode on fields of a primitive type, this looks like an improvement, but in general I think it makes things worse. If we believed that improving the hashCode performance on schemas with shared structures would significantly help some applications then we might re-design this. The complicated case to get right is circular schemas, especially circles involving multiple records, for example A contains a B that contains an A. When you invalidate the cache for one element of the circle then you need to invalidate the cache for other elements of the circle, since they're interdependent. With the current implementation this is not required, since if a cache is invalidated anywhere then no cache is used when things are recomputed. A good example of complex circles is testSchemaExplosion. Making this case work while still caching hashcodes for record schemas is hard. > Schema.computeHash() to add if check to avoid unnecessary hashcode computation > ------------------------------------------------------------------------------ > > Key: AVRO-1428 > URL: https://issues.apache.org/jira/browse/AVRO-1428 > Project: Avro > Issue Type: Improvement > Components: java > Reporter: Tie Liu > Attachments: AVRO-1428.patch, AVRO-1428.patch-V2, TestAvro.java > > > In current Schma.java we have following implementation: > public final int hashCode() { > if (hashCode == NO_HASHCODE) > hashCode = computeHash(); > return hashCode; > } > int computeHash() { return getType().hashCode() + props.hashCode(); } > While hashCode is doing the checking of "if (hashCode == NO_HASHCODE)", the > computeHash method is not. But the computeHash method is being called from > Schema$Field.hashCode and the subclasses hashCode implementation like > following: > public int hashCode() { return name.hashCode() + schema.computeHash(); } > //this is from Schema$Field class > This is causing the the calculation of hashCode getting called > unnecessarily extensively. The proposed changed is to add the "if" check > inside the computeHash method instead: > int computeHash() > { > if (hashCode == NO_HASHCODE) > { > hashCode = getType().hashCode() + props.hashCode(); > } > return hashCode; > } > We did a simple test to compare the performance difference, below is a > summary of the heap snapshot of comparing the difference: > As a test I wrote a small program that creates a HashMap<Schema.Field, > Integer>() and enters a loop simply identifying whether various Schema.Field > instances are keys in the map. Obviously this is a pathological test case, > but when running with the current implementation of Schema.Field it has (in > about 30 seconds) used up nearly 8 GBytes of heap in instantiating > intermediate objects associated with calling Schema.computeHash(): > Heap > PSYoungGen total 17432576K, used 8666481K [0x0000000340000000, > 0x0000000800000000, 0x0000000800000000) > eden space 14942208K, 58% used > [0x0000000340000000,0x0000000550f5c650,0x00000006d0000000) > from space 2490368K, 0% used > [0x0000000768000000,0x0000000768000000,0x0000000800000000) > to space 2490368K, 0% used > [0x00000006d0000000,0x00000006d0000000,0x0000000768000000) > ParOldGen total 1048576K, used 0K [0x0000000300000000, > 0x0000000340000000, 0x0000000340000000) > object space 1048576K, 0% used > [0x0000000300000000,0x0000000300000000,0x0000000340000000) > PSPermGen total 21504K, used 5782K [0x00000002fae00000, > 0x00000002fc300000, 0x0000000300000000) > object space 21504K, 26% used > [0x00000002fae00000,0x00000002fb3a5818,0x00000002fc300000) > When running with the modified implementation (and no other change) all the > object allocation vanishes: > Heap > PSYoungGen total 17432576K, used 896532K [0x0000000340000000, > 0x0000000800000000, 0x0000000800000000) > eden space 14942208K, 6% used > [0x0000000340000000,0x0000000376b852d0,0x00000006d0000000) > from space 2490368K, 0% used > [0x0000000768000000,0x0000000768000000,0x0000000800000000) > to space 2490368K, 0% used > [0x00000006d0000000,0x00000006d0000000,0x0000000768000000) > ParOldGen total 1048576K, used 0K [0x0000000300000000, > 0x0000000340000000, 0x0000000340000000) > object space 1048576K, 0% used > [0x0000000300000000,0x0000000300000000,0x0000000340000000) > PSPermGen total 21504K, used 5768K [0x00000002fae00000, > 0x00000002fc300000, 0x0000000300000000) > object space 21504K, 26% used > [0x00000002fae00000,0x00000002fb3a2240,0x00000002fc300000) > As a side-effect the test runs x3 faster with the modified hashCode() > implementation. -- This message was sent by Atlassian JIRA (v6.1.5#6160)