[jira] Commented: (AVRO-388) Using ResolvingDecoder in GenericDatumReader

Doug Cutting (JIRA) Mon, 01 Feb 2010 10:52:44 -0800

    [ 
https://issues.apache.org/jira/browse/AVRO-388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12828219#action_12828219
 ]


Doug Cutting commented on AVRO-388:
-----------------------------------

It's worth noting that, once this is committed, all Java DatumReader 
implementations will be parser-based.  This is a bit frightening, since I think 
few beside Thiru understand fully how the this works.  It does have 
documentation.

http://hadoop.apache.org/avro/docs/current/api/java/org/apache/avro/io/parsing/doc-files/parsing.html

But I have never made it through a detailed read of that document and only 
claim a partial understanding.

The parser-based implementation has real advantages: it's faster and it 
simplifies things like validation, json encoders/decoders, etc.  So, on the 
whole, I think this is a good change, but it does scare me a bit, since, 
without Thiru, we'd have a hard time maintaining it.


> Using ResolvingDecoder in GenericDatumReader
> --------------------------------------------
>
>                 Key: AVRO-388
>                 URL: https://issues.apache.org/jira/browse/AVRO-388
>             Project: Avro
>          Issue Type: Improvement
>          Components: java
>            Reporter: Thiruvalluvan M. G.
>            Assignee: Thiruvalluvan M. G.
>         Attachments: AVRO-388-test.patch, AVRO-388.patch
>
>
> This patch removes GenericDatumReader's own logic for schema resolution by 
> relying on ResolvingDecoder. The code has become somewhat compact. The 
> duplication logic we were concerned about in AVRO-383 is gone.
> There are three changes to the public interface of GenericDatumReader:
>    * readInt() and readString() functions no longer provide the writer's 
> schema. I think it is reasonable that code outside GenericDatumReader should 
> not worry about schema resolution and thus don't have to know about writer's 
> schema.
>    * setSchema() function of DatumReader may throw IOException.
> None in Avro codebase is affected by these changes.
> In terms of performance:
>    * When there is no resolution required, that is reader's and writer's 
> schemas are identical, there is no change in performance. In fact, 
> ResolvingEncoder is not used in this case. Use Perf -G to measuer.
>    * If the resolution involves only promotions (from int to long)  the new 
> code is about 5% faster. User Perf -Gp
>    * If the resolution involves supplying default values from the reader's 
> schema because writer's schema does have a field, the node is faster about 
> 15%. This improvement is probably due to AVRO-383. User Perf -Gd to measure 
> this.
>    * If the resolution involves field reordering (reader's and writer's 
> schema have the same fields, but in different order) the new code is _slower_ 
> by about 20%. This is because, ResolvingDecoder provides information on field 
> reordering through a vector of Schema.Field objects. The Field class has the 
> field position but not the field name. But GenericDatumWriter's setField() 
> function needs the field name. getFieldName() function introduced in this 
> patch does a linear search to find the name from the reader's schema. This is 
> inefficient. We can handle this problem in any of the following ways:
>       1. Do nothing claiming that field reordering is rare and we can live 
> with this inefficiency. I tried this and saw that the new code performs about 
> 5-10% faster than the old.
>       2. Change the signature of the setField() so that it does not require 
> the field name. This is an incomaptible change to the GenericDatumReader's 
> public interface. None of the code in the current Avro will need field name. 
> But if somebody has extended GenericDatumReader, they will be affected.
>       3. Have a new class FieldInfo which has the name and position of the 
> field and make ResolvingDecoder return an array of FieldInfo instead of 
> Field. It's not obvious, which package/class this new class would go to. If 
> one wants a clean dependencies, probably it belongs to io package. I'm not 
> confortable adding another class into the io package at the present. Keeping 
> it as an inner class of ResolvingDecoder will create circular dependencies 
> between the io.parsing package and ResolvingDecoder class.
>      4. Instead of a new class FieldInfo above use Map.Entry<String, Field>.
>      5. Add a new field called "name" to Schema.Field class. It will be 
> somewhat redundant because the key of the fields map of RecordSchema is the 
> field name. With this, the field name will be available both as the key and 
> as a part of value of the map.
>      My personal preference is 5 followed by 4.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (AVRO-388) Using ResolvingDecoder in GenericDatumReader

Reply via email to