[ 
https://issues.apache.org/jira/browse/AVRO-656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12906867#action_12906867
 ] 

Doug Cutting commented on AVRO-656:
-----------------------------------

> That would be a major change in what the Union is and what you can do with it.

The specification is primarily concerned with (a) schema & protocol syntax; (b) 
format of corresponding data.  So, as long as an implementation produces and 
consumes valid schemas and data, it's a conforming implementation.  A 
high-fidelity implementation can read and write data without alteration, but an 
implementation that cannot write data exactly as read might still be both 
useful and correctly implement the Avro specification.

> If this means that an implementation can't use a string directly for an enum, 
> but instead uses sentinel objects or a container with a value string and name 
> string, Isn't that OK?

Sure, that's okay.  But currently Ruby, PHP and Python don't distinguish bytes, 
enum and fixed at runtime.  This is fine except in the case of a union that 
contains these types.  In that case, an application may end up treating a value 
intended to be one type as a different type.  That may be a problem for some 
applications, and may not be for others.  Hopefully someone will fix these 
implementations, e.g., to wrap such union values.  But I don't think in the 
meantime we need to declare that these implementations are non-conforming or 
change the spec.  Rather we should document the limitation and file bugs to 
improve the implementations.

A primary question of this issue is whether to continue to permit multiple 
enums and fixed in a union, distinguished by name.  No implementation takes 
advantage of this today, and it might make implementations simpler to drop 
this, permitting only a single enum and fixed per union.  So far, no one has 
presented a use case for this feature.

I'd also like to see Ruby, Python and PHP improve their union handling by 
avoiding recursive validation.  If they add a name to each record instance this 
is easy, and better implements the spirit of the specification.  Adding 
wrappers for enum, fixed and bytes would also be good, but is a bigger change.


> writing unions with multiple records, fixed or enums can choose wrong branch 
> -----------------------------------------------------------------------------
>
>                 Key: AVRO-656
>                 URL: https://issues.apache.org/jira/browse/AVRO-656
>             Project: Avro
>          Issue Type: Bug
>          Components: java
>    Affects Versions: 1.4.0
>            Reporter: Doug Cutting
>            Assignee: Doug Cutting
>         Attachments: AVRO-656.patch
>
>
> According to the specification, a union may contain multiple instances of a 
> named type, provided they have different names.  There are several bugs in 
> the Java implementation of this when writing data:
>  - for record, only the short-name of the record is checked, so the branch 
> for a record of the same name in a different namespace may be used by mistake
>  - for enum and fixed, the name of the record is not checked, so the first 
> enum or fixed in the union will always be assumed when writing.  in many 
> cases this may cause the wrong data to be written, potentially corrupting 
> output.
> This is not a regression.  This has never been implemented correctly by Java. 
>  Python and Ruby never check names, but rather perform a full, recursive 
> validation of content.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to