----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/12480/#review23113 -----------------------------------------------------------
Do you have after-optimization performance numbers? Can you add a test to verify that the reencoder cache is working correctly? Feed in a record with one uuid, then another with a different and verify that the cache has two elements. Adding a third record with the original UUID shouldn't increase the size of the cache. Also, that adding n records all with the same schema creates only one reencoder... serde/src/java/org/apache/hadoop/hive/serde2/avro/AvroDeserializer.java <https://reviews.apache.org/r/12480/#comment46953> verifiedRecordReaders -> noReencodingNeeded ? serde/src/java/org/apache/hadoop/hive/serde2/avro/AvroDeserializer.java <https://reviews.apache.org/r/12480/#comment46956> readability: pull out getRecordReaderID into its own var serde/src/java/org/apache/hadoop/hive/serde2/avro/AvroGenericRecordWritable.java <https://reviews.apache.org/r/12480/#comment46958> Need to write out the uuid too serde/src/java/org/apache/hadoop/hive/serde2/avro/AvroGenericRecordWritable.java <https://reviews.apache.org/r/12480/#comment46959> Need to read in the uuid too - Jakob Homan On July 11, 2013, 3:31 p.m., Mohammad Islam wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/12480/ > ----------------------------------------------------------- > > (Updated July 11, 2013, 3:31 p.m.) > > > Review request for hive, Ashutosh Chauhan and Jakob Homan. > > > Bugs: HIVE-4732 > https://issues.apache.org/jira/browse/HIVE-4732 > > > Repository: hive-git > > > Description > ------- > > From our performance analysis, we found AvroSerde's schema.equals() call > consumed a substantial amount ( nearly 40%) of time. This patch intends to > minimize the number schema.equals() calls by pushing the check as late/fewer > as possible. > > At first, we added a unique id for each record reader which is then included > in every AvroGenericRecordWritable. Then, we introduce two new data > structures (one hashset and one hashmap) to store intermediate data to avoid > duplicates checkings. Hashset contains all the record readers' IDs that don't > need any re-encoding. On the other hand, HashMap contains the already used > re-encoders. It works as cache and allows re-encoders reuse. With this > change, our test shows nearly 40% reduction in Avro record reading time. > > > > > Diffs > ----- > > ql/src/java/org/apache/hadoop/hive/ql/io/avro/AvroGenericRecordReader.java > dbc999f > serde/src/java/org/apache/hadoop/hive/serde2/avro/AvroDeserializer.java > c85ef15 > > serde/src/java/org/apache/hadoop/hive/serde2/avro/AvroGenericRecordWritable.java > 66f0348 > serde/src/test/org/apache/hadoop/hive/serde2/avro/TestSchemaReEncoder.java > 9af751b > serde/src/test/org/apache/hadoop/hive/serde2/avro/Utils.java 2b948eb > > Diff: https://reviews.apache.org/r/12480/diff/ > > > Testing > ------- > > > Thanks, > > Mohammad Islam > >