Re: Review Request 12480: HIVE-4732 Reduce or eliminate the expensive Schema equals() check for AvroSerde

Jakob Homan Sun, 25 Aug 2013 22:36:17 -0700

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/12480/#review25537
-----------------------------------------------------------



One issue in the testing and a few formatting issues.  Otherwise looks good.


serde/src/test/org/apache/hadoop/hive/serde2/avro/TestAvroDeserializer.java
<https://reviews.apache.org/r/12480/#comment49986>

    Weird spacing... 2x below as well.



serde/src/test/org/apache/hadoop/hive/serde2/avro/Utils.java
<https://reviews.apache.org/r/12480/#comment49984>

    These should never be null, not even in testing.  It's better to change the 
tests to correctly populate the data structure.



serde/src/test/org/apache/hadoop/hive/serde2/avro/Utils.java
<https://reviews.apache.org/r/12480/#comment49985>

    And this would indicate a bug.


- Jakob Homan


On Aug. 6, 2013, 7:13 p.m., Mohammad Islam wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/12480/
> -----------------------------------------------------------
> 
> (Updated Aug. 6, 2013, 7:13 p.m.)
> 
> 
> Review request for hive, Ashutosh Chauhan and Jakob Homan.
> 
> 
> Bugs: HIVE-4732
>     https://issues.apache.org/jira/browse/HIVE-4732
> 
> 
> Repository: hive-git
> 
> 
> Description
> -------
> 
> From our performance analysis, we found AvroSerde's schema.equals() call 
> consumed a substantial amount ( nearly 40%) of time. This patch intends to 
> minimize the number schema.equals() calls by pushing the check as late/fewer 
> as possible.
> 
> At first, we added a unique id for each record reader which is then included 
> in every AvroGenericRecordWritable. Then, we introduce two new data 
> structures (one hashset and one hashmap) to store intermediate data to avoid 
> duplicates checkings. Hashset contains all the record readers' IDs that don't 
> need any re-encoding. On the other hand, HashMap contains the already used 
> re-encoders. It works as cache and allows re-encoders reuse. With this 
> change, our test shows nearly 40% reduction in Avro record reading time.
>  
>    
> 
> 
> Diffs
> -----
> 
>   ql/src/java/org/apache/hadoop/hive/ql/io/avro/AvroGenericRecordReader.java 
> ed2a9af 
>   serde/src/java/org/apache/hadoop/hive/serde2/avro/AvroDeserializer.java 
> e994411 
>   
> serde/src/java/org/apache/hadoop/hive/serde2/avro/AvroGenericRecordWritable.java
>  66f0348 
>   serde/src/test/org/apache/hadoop/hive/serde2/avro/TestAvroDeserializer.java 
> 3828940 
>   serde/src/test/org/apache/hadoop/hive/serde2/avro/TestSchemaReEncoder.java 
> 9af751b 
>   serde/src/test/org/apache/hadoop/hive/serde2/avro/Utils.java 2b948eb 
> 
> Diff: https://reviews.apache.org/r/12480/diff/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Mohammad Islam
> 
>

Re: Review Request 12480: HIVE-4732 Reduce or eliminate the expensive Schema equals() check for AvroSerde

Reply via email to