[
https://issues.apache.org/jira/browse/AVRO-656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12906209#action_12906209
]
Scott Carey commented on AVRO-656:
----------------------------------
bq. Arguably we shouldn't worry so much. If an implementation can't distinguish
between string and bytes then it should not be expected to preserve that
distinction.
That would be a major change in what the Union is and what you can do with it.
For example, you might want a union of string and bytes, where the string is a
hex representation of some data, and the bytes are raw data. If the
distinction can't be preserved, you can't use unions to store different
representations of the same data. What if one language does not differentiate
between string and bytes, because its implicit assumption is that strings are
just utf8 byte arrays. Another language likely cannot differentiate those two,
but assumes strings are LittleEndian encoded UTF16 byte arrays? If avro can't
guarantee that a user can find out what branch of the union a piece of data
came from, and doesn't allow specifying what it should be when written, then I
think we've just blown away a lot of cross-language compatibility.
What if an implementation only has strings, and can't differentiate between
strings and numerics without parsing the string? I think it should be required
to tag/flag the union field with what type it is and expose that to the user.
In fact, I think all implementations should be expected to expose what avro
trype the branch of a union field is one way or another. We can't really be
'magic' here and expect to achieve cross language capabilities.
A user needs to be able to ask the implementation: "what branch of the union
is this union field" and specify "store this union field using branch X" when
there is ambiguity present in the language. An implementation might not
require that a user specify what type it is setting and default to the first
matching type, but that should be up to the user.
bq. Implementations will read data into the highest fidelity representation
they can, but an implementation that represents floats as doubles will not be
able to always write exactly the data it reads when processing a [float,double]
union.
I think if a user wants to write exactly what was read, it should be possible.
So a language that uses doubles internally for both float and double would need
to tag the union field it reads with what type it was when it was read and make
that available, so that a user could make an informed decision on whether to
serialize as a float or double.
bq. Folks could be advised to order their unions to guard against this.
I think doing too much implicitly here will lead to trouble, especially since
the possible combinations of things various languages might do when present
with ambiguity is large and may not be understood at the time a schema is
defined.
Back to the original problem, I'm not sure I get it. Records, Enums, and
Fixed are named types. If the type is named, why is it so hard to figure out
what branch it belongs to? If this means that an implementation can't use a
string directly for an enum, but instead uses sentinel objects or a container
with a value string and name string, Isn't that OK?
If an implementation can't distinguish strings and bytes by type, shouldn't it
track what branch it is some other way than the type?
If an implementation can't distinguish between bytes and fixed (like Java), it
can wrap the fixed in a container and keep the name somewhere.
All implementations have at their disposal the ability to keep an additional
internal value that tracks the union branch if it is ambiguous due to the
language or otherwise.
Am I missing something?
> writing unions with multiple records, fixed or enums can choose wrong branch
> -----------------------------------------------------------------------------
>
> Key: AVRO-656
> URL: https://issues.apache.org/jira/browse/AVRO-656
> Project: Avro
> Issue Type: Bug
> Components: java
> Affects Versions: 1.4.0
> Reporter: Doug Cutting
> Assignee: Doug Cutting
> Attachments: AVRO-656.patch
>
>
> According to the specification, a union may contain multiple instances of a
> named type, provided they have different names. There are several bugs in
> the Java implementation of this when writing data:
> - for record, only the short-name of the record is checked, so the branch
> for a record of the same name in a different namespace may be used by mistake
> - for enum and fixed, the name of the record is not checked, so the first
> enum or fixed in the union will always be assumed when writing. in many
> cases this may cause the wrong data to be written, potentially corrupting
> output.
> This is not a regression. This has never been implemented correctly by Java.
> Python and Ruby never check names, but rather perform a full, recursive
> validation of content.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.