Re: Review Request 12480: HIVE-4732 Reduce or eliminate the expensive Schema equals() check for AvroSerde

Jakob Homan Fri, 12 Jul 2013 15:45:46 -0700

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/12480/#review23113
-----------------------------------------------------------



Do you have after-optimization performance numbers?  Can you add a test to 
verify that the reencoder cache is working correctly?  Feed in a record with 
one uuid, then another with a different and verify that the cache has two 
elements.  Adding a third record with the original UUID shouldn't increase the 
size of the cache.  Also, that adding n records all with the same schema 
creates only one reencoder...


serde/src/java/org/apache/hadoop/hive/serde2/avro/AvroDeserializer.java
<https://reviews.apache.org/r/12480/#comment46953>

    verifiedRecordReaders -> noReencodingNeeded ?



serde/src/java/org/apache/hadoop/hive/serde2/avro/AvroDeserializer.java
<https://reviews.apache.org/r/12480/#comment46956>

    readability: pull out getRecordReaderID into its own var



serde/src/java/org/apache/hadoop/hive/serde2/avro/AvroGenericRecordWritable.java
<https://reviews.apache.org/r/12480/#comment46958>

    Need to write out the uuid too



serde/src/java/org/apache/hadoop/hive/serde2/avro/AvroGenericRecordWritable.java
<https://reviews.apache.org/r/12480/#comment46959>

    Need to read in the uuid too


- Jakob Homan


On July 11, 2013, 3:31 p.m., Mohammad Islam wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/12480/
> -----------------------------------------------------------
> 
> (Updated July 11, 2013, 3:31 p.m.)
> 
> 
> Review request for hive, Ashutosh Chauhan and Jakob Homan.
> 
> 
> Bugs: HIVE-4732
>     https://issues.apache.org/jira/browse/HIVE-4732
> 
> 
> Repository: hive-git
> 
> 
> Description
> -------
> 
> From our performance analysis, we found AvroSerde's schema.equals() call 
> consumed a substantial amount ( nearly 40%) of time. This patch intends to 
> minimize the number schema.equals() calls by pushing the check as late/fewer 
> as possible.
> 
> At first, we added a unique id for each record reader which is then included 
> in every AvroGenericRecordWritable. Then, we introduce two new data 
> structures (one hashset and one hashmap) to store intermediate data to avoid 
> duplicates checkings. Hashset contains all the record readers' IDs that don't 
> need any re-encoding. On the other hand, HashMap contains the already used 
> re-encoders. It works as cache and allows re-encoders reuse. With this 
> change, our test shows nearly 40% reduction in Avro record reading time.
>  
>    
> 
> 
> Diffs
> -----
> 
>   ql/src/java/org/apache/hadoop/hive/ql/io/avro/AvroGenericRecordReader.java 
> dbc999f 
>   serde/src/java/org/apache/hadoop/hive/serde2/avro/AvroDeserializer.java 
> c85ef15 
>   
> serde/src/java/org/apache/hadoop/hive/serde2/avro/AvroGenericRecordWritable.java
>  66f0348 
>   serde/src/test/org/apache/hadoop/hive/serde2/avro/TestSchemaReEncoder.java 
> 9af751b 
>   serde/src/test/org/apache/hadoop/hive/serde2/avro/Utils.java 2b948eb 
> 
> Diff: https://reviews.apache.org/r/12480/diff/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Mohammad Islam
> 
>

Re: Review Request 12480: HIVE-4732 Reduce or eliminate the expensive Schema equals() check for AvroSerde

Reply via email to