[ 
https://issues.apache.org/jira/browse/AVRO-1428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13873831#comment-13873831
 ] 

Tie Liu commented on AVRO-1428:
-------------------------------

yes, a given call to RecordSchema#computeHash keeps track of all the hashcodes 
for the schema has been calculated, but not the case for the UnionSchema. In 
the example I gave above, when calculating the hashcode for that union schema, 
it need to calculate the hashcode, it calls:
=>TypeASchema.computeHash =>TypeDSchema.computeHash
but after TypeASchema.computeHash return, it will clear the SEEN_HASHCODE. so 
when it calculate the TypeBSchema.computeHash, it need to compute 
TypeDSchema.computeHash again.

RecordSchema.computeHash:
        @Override
        int computeHash()
        {
            Map seen = SEEN_HASHCODE.get();
            if (seen.containsKey(this))
                return 0; // prevent stack overflow
            boolean first = seen.isEmpty();
            try
            {
                seen.put(this, this);
                return super.computeHash() + fields.hashCode();
            }
            finally
            {
                if (first)
                    seen.clear();
            }
        }

UnionSchema.computeHash:
        @Override
        int computeHash()
        {
            int hash = super.computeHash();
            for (Schema type : types)
                hash += type.computeHash();
            return hash;
        }


> Schema.computeHash() to add if check to avoid unnecessary hashcode computation
> ------------------------------------------------------------------------------
>
>                 Key: AVRO-1428
>                 URL: https://issues.apache.org/jira/browse/AVRO-1428
>             Project: Avro
>          Issue Type: Improvement
>          Components: java
>            Reporter: Tie Liu
>         Attachments: AVRO-1428.patch, AVRO-1428.patch-V2, TestAvro.java
>
>
> In current Schma.java we have following implementation:
>   public final int hashCode() {
>     if (hashCode == NO_HASHCODE)
>       hashCode = computeHash();
>     return hashCode;
>   }
>   int computeHash() { return getType().hashCode() + props.hashCode(); }
>   While hashCode is doing the checking of "if (hashCode == NO_HASHCODE)", the 
> computeHash method is not. But the computeHash method is being called from 
> Schema$Field.hashCode and the subclasses hashCode implementation like 
> following:
> public int hashCode() { return name.hashCode() + schema.computeHash(); }  
> //this is from Schema$Field class
>   This is causing the the calculation of hashCode getting called 
> unnecessarily extensively. The proposed changed is to add the "if" check 
> inside the computeHash method instead:
>     int computeHash()
>     {
>         if (hashCode == NO_HASHCODE)
>         {
>             hashCode = getType().hashCode() + props.hashCode();
>         }
>         return hashCode;
> }
> We did a simple test to compare the performance difference, below is a 
> summary of the heap snapshot of comparing the difference:
> As a test I wrote a small program that creates a HashMap<Schema.Field, 
> Integer>() and enters a loop simply identifying whether various Schema.Field 
> instances are keys in the map. Obviously this is a pathological test case, 
> but when running with the current implementation of Schema.Field it has (in 
> about 30 seconds) used up nearly 8 GBytes of heap in instantiating 
> intermediate objects associated with calling Schema.computeHash():
> Heap
> PSYoungGen      total 17432576K, used 8666481K [0x0000000340000000, 
> 0x0000000800000000, 0x0000000800000000)
>   eden space 14942208K, 58% used 
> [0x0000000340000000,0x0000000550f5c650,0x00000006d0000000)
>   from space 2490368K, 0% used 
> [0x0000000768000000,0x0000000768000000,0x0000000800000000)
>   to   space 2490368K, 0% used 
> [0x00000006d0000000,0x00000006d0000000,0x0000000768000000)
> ParOldGen       total 1048576K, used 0K [0x0000000300000000, 
> 0x0000000340000000, 0x0000000340000000)
>   object space 1048576K, 0% used 
> [0x0000000300000000,0x0000000300000000,0x0000000340000000)
> PSPermGen       total 21504K, used 5782K [0x00000002fae00000, 
> 0x00000002fc300000, 0x0000000300000000)
>   object space 21504K, 26% used 
> [0x00000002fae00000,0x00000002fb3a5818,0x00000002fc300000)
> When running with the modified implementation (and no other change) all the 
> object allocation vanishes:
> Heap
> PSYoungGen      total 17432576K, used 896532K [0x0000000340000000, 
> 0x0000000800000000, 0x0000000800000000)
>   eden space 14942208K, 6% used 
> [0x0000000340000000,0x0000000376b852d0,0x00000006d0000000)
>   from space 2490368K, 0% used 
> [0x0000000768000000,0x0000000768000000,0x0000000800000000)
>   to   space 2490368K, 0% used 
> [0x00000006d0000000,0x00000006d0000000,0x0000000768000000)
> ParOldGen       total 1048576K, used 0K [0x0000000300000000, 
> 0x0000000340000000, 0x0000000340000000)
>   object space 1048576K, 0% used 
> [0x0000000300000000,0x0000000300000000,0x0000000340000000)
> PSPermGen       total 21504K, used 5768K [0x00000002fae00000, 
> 0x00000002fc300000, 0x0000000300000000)
>   object space 21504K, 26% used 
> [0x00000002fae00000,0x00000002fb3a2240,0x00000002fc300000)
> As a side-effect the test runs x3 faster with the modified hashCode() 
> implementation.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Reply via email to