[ https://issues.apache.org/jira/browse/AVRO-388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12828219#action_12828219 ]
Doug Cutting commented on AVRO-388: ----------------------------------- It's worth noting that, once this is committed, all Java DatumReader implementations will be parser-based. This is a bit frightening, since I think few beside Thiru understand fully how the this works. It does have documentation. http://hadoop.apache.org/avro/docs/current/api/java/org/apache/avro/io/parsing/doc-files/parsing.html But I have never made it through a detailed read of that document and only claim a partial understanding. The parser-based implementation has real advantages: it's faster and it simplifies things like validation, json encoders/decoders, etc. So, on the whole, I think this is a good change, but it does scare me a bit, since, without Thiru, we'd have a hard time maintaining it. > Using ResolvingDecoder in GenericDatumReader > -------------------------------------------- > > Key: AVRO-388 > URL: https://issues.apache.org/jira/browse/AVRO-388 > Project: Avro > Issue Type: Improvement > Components: java > Reporter: Thiruvalluvan M. G. > Assignee: Thiruvalluvan M. G. > Attachments: AVRO-388-test.patch, AVRO-388.patch > > > This patch removes GenericDatumReader's own logic for schema resolution by > relying on ResolvingDecoder. The code has become somewhat compact. The > duplication logic we were concerned about in AVRO-383 is gone. > There are three changes to the public interface of GenericDatumReader: > * readInt() and readString() functions no longer provide the writer's > schema. I think it is reasonable that code outside GenericDatumReader should > not worry about schema resolution and thus don't have to know about writer's > schema. > * setSchema() function of DatumReader may throw IOException. > None in Avro codebase is affected by these changes. > In terms of performance: > * When there is no resolution required, that is reader's and writer's > schemas are identical, there is no change in performance. In fact, > ResolvingEncoder is not used in this case. Use Perf -G to measuer. > * If the resolution involves only promotions (from int to long) the new > code is about 5% faster. User Perf -Gp > * If the resolution involves supplying default values from the reader's > schema because writer's schema does have a field, the node is faster about > 15%. This improvement is probably due to AVRO-383. User Perf -Gd to measure > this. > * If the resolution involves field reordering (reader's and writer's > schema have the same fields, but in different order) the new code is _slower_ > by about 20%. This is because, ResolvingDecoder provides information on field > reordering through a vector of Schema.Field objects. The Field class has the > field position but not the field name. But GenericDatumWriter's setField() > function needs the field name. getFieldName() function introduced in this > patch does a linear search to find the name from the reader's schema. This is > inefficient. We can handle this problem in any of the following ways: > 1. Do nothing claiming that field reordering is rare and we can live > with this inefficiency. I tried this and saw that the new code performs about > 5-10% faster than the old. > 2. Change the signature of the setField() so that it does not require > the field name. This is an incomaptible change to the GenericDatumReader's > public interface. None of the code in the current Avro will need field name. > But if somebody has extended GenericDatumReader, they will be affected. > 3. Have a new class FieldInfo which has the name and position of the > field and make ResolvingDecoder return an array of FieldInfo instead of > Field. It's not obvious, which package/class this new class would go to. If > one wants a clean dependencies, probably it belongs to io package. I'm not > confortable adding another class into the io package at the present. Keeping > it as an inner class of ResolvingDecoder will create circular dependencies > between the io.parsing package and ResolvingDecoder class. > 4. Instead of a new class FieldInfo above use Map.Entry<String, Field>. > 5. Add a new field called "name" to Schema.Field class. It will be > somewhat redundant because the key of the fields map of RecordSchema is the > field name. With this, the field name will be available both as the key and > as a part of value of the map. > My personal preference is 5 followed by 4. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.