[jira] Commented: (AVRO-656) writing unions with multiple records, fixed or enums can choose wrong branch

Scott Carey (JIRA) Fri, 03 Sep 2010 18:55:15 -0700

    [ 
https://issues.apache.org/jira/browse/AVRO-656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12906209#action_12906209
 ]


Scott Carey commented on AVRO-656:
----------------------------------

bq. Arguably we shouldn't worry so much. If an implementation can't distinguish 
between string and bytes then it should not be expected to preserve that 
distinction.

That would be a major change in what the Union is and what you can do with it.

For example, you might want a union of string and bytes, where the string is a 
hex representation of some data, and the bytes are raw data.  If the 
distinction can't be preserved, you can't use unions to store different 
representations of the same data.  What if one language does not differentiate 
between string and bytes, because its implicit assumption is that strings are 
just utf8 byte arrays.  Another language likely cannot differentiate those two, 
but assumes strings are LittleEndian encoded UTF16 byte arrays?   If avro can't 
guarantee that a user can find out what branch of the union a piece of data 
came from, and doesn't allow specifying what it should be when written, then I 
think we've just blown away a lot of cross-language compatibility.  

What if an implementation only has strings, and can't differentiate between 
strings and numerics without parsing the string?  I think it should be required 
to tag/flag the union field with what type it is and expose that to the user.  
In fact, I think all implementations should be expected to expose what avro 
trype the branch of a union field is one way or another.  We can't really be 
'magic' here and expect to achieve cross language capabilities.


A user needs to be able to ask the implementation:  "what branch of the union 
is this union field" and specify "store this union field using branch X" when 
there is ambiguity present in the language.  An implementation might not 
require that a user specify what type it is setting and default to the first 
matching type, but that should be up to the user.

bq. Implementations will read data into the highest fidelity representation 
they can, but an implementation that represents floats as doubles will not be 
able to always write exactly the data it reads when processing a [float,double] 
union.   

I think if a user wants to write exactly what was read, it should be possible.
So a language that uses doubles internally for both float and double would need 
to tag the union field it reads with what type it was when it was read and make 
that available, so that a user could make an informed decision on whether to 
serialize as a float or double.

bq.  Folks could be advised to order their unions to guard against this.

I think doing too much implicitly here will lead to trouble, especially since 
the possible combinations of things various languages might do when present 
with ambiguity is large and may not be understood at the time a schema is 
defined.


Back to the original problem, I'm not sure I get it.   Records, Enums, and 
Fixed are named types.  If the type is named, why is it so hard to figure out 
what branch it belongs to?  If this means that an implementation can't use a 
string directly for an enum, but instead uses sentinel objects or a container 
with a value string and name string, Isn't that OK?   
If an implementation can't distinguish strings and bytes by type, shouldn't it 
track what branch it is some other way than the type?  
If an implementation can't distinguish between bytes and fixed (like Java), it 
can wrap the fixed in a container and keep the name somewhere.

All implementations have at their disposal the ability to keep an additional 
internal value that tracks the union branch if it is ambiguous due to the 
language or otherwise.

Am I missing something?

> writing unions with multiple records, fixed or enums can choose wrong branch 
> -----------------------------------------------------------------------------
>
>                 Key: AVRO-656
>                 URL: https://issues.apache.org/jira/browse/AVRO-656
>             Project: Avro
>          Issue Type: Bug
>          Components: java
>    Affects Versions: 1.4.0
>            Reporter: Doug Cutting
>            Assignee: Doug Cutting
>         Attachments: AVRO-656.patch
>
>
> According to the specification, a union may contain multiple instances of a 
> named type, provided they have different names.  There are several bugs in 
> the Java implementation of this when writing data:
>  - for record, only the short-name of the record is checked, so the branch 
> for a record of the same name in a different namespace may be used by mistake
>  - for enum and fixed, the name of the record is not checked, so the first 
> enum or fixed in the union will always be assumed when writing.  in many 
> cases this may cause the wrong data to be written, potentially corrupting 
> output.
> This is not a regression.  This has never been implemented correctly by Java. 
>  Python and Ruby never check names, but rather perform a full, recursive 
> validation of content.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (AVRO-656) writing unions with multiple records, fixed or enums can choose wrong branch

Reply via email to