[
https://issues.apache.org/jira/browse/AVRO-656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12906075#action_12906075
]
Scott Carey commented on AVRO-656:
----------------------------------
OK, I'm going to review all my in use schemas and see what the above options
would break.
First, there is the schema used to represent an arbitrary Pig field, which the
second alternative would break:
{code}
List<Schema> pigTypes = new ArrayList<Schema>();
pigTypes.add(Schema.create(Type.NULL));
pigTypes.add(Schema.create(Type.BOOLEAN));
pigTypes.add(Schema.create(Type.INT));
pigTypes.add(Schema.create(Type.LONG));
pigTypes.add(Schema.create(Type.FLOAT));
pigTypes.add(Schema.create(Type.DOUBLE));
pigTypes.add(Schema.create(Type.STRING));
pigTypes.add(Schema.create(Type.BYTES));
pigTypes.add(Schema.createArray(GENERIC_TUPLE));
pigTypes.add(GENERIC_TUPLE); // Tuple is a record containing a list of
fields of type GENERIC_FIELD_UNION
pigTypes.add(GENERIC_ELEMENT_MAP); // Map is a map from String to
GENERIC_FIELD_UNION
GENERIC_FIELD_UNION = Schema.createUnion(pigTypes);
{code}
I had tried to create an enum with multiple fixed types and ran into issues
long ago. I thought I was doing something wrong, actually.
I have long since wrapped these in a record. So I have avoided this bug due to
that:
{code}
[
{"name": "com.rr.avro.Fixed16", "type": "fixed", "size":16},
{"name": "com.rr.avro.Fixed4", "type": "fixed", "size":4},
{"name": "com.rr.avro.MyRecord", "type": "record", "fields": [
{"name": "hostIp", "type": ["Fixed4", "Fixed16"], "doc": "should always be 4
bytes (IPv4) or 16 bytes (IPv6)"},
... (more fields)
}}
]
{code}
I have some other unions like this that are important:
["Fixed16", "string", "null"]
So in short, I think the first option makes sense from my use cases and the
second one is very restrictive.
It might make sense to simplify it and say that enum and/or fixed are not
allowed in UNION at all -- they must be wrapped in a named record. Limiting it
to only one of each might be somewhat useful, but be more complicated.
Alternatively, making some or all of the unnamed types named might help too.
Making only one symbolic type allowed in a union is restrictive, especially
since I already have use cases for combining fixed, string, and bytes in a
union.
What about something like:
['BrowserTypeEnum", "string"] as a union. BrowserTypeEnum is a canonicalized
set of known browsers. If a user-agent string can't be bucketed into one of
the known types, its full string is stored instead. Sure, we could instead
have a record with an enum and a nullable string in it instead, but now you
have a case where it could be both types at once. The purpose of the Union is
to guarantee its only one of the branches.
> writing unions with multiple records, fixed or enums can choose wrong branch
> -----------------------------------------------------------------------------
>
> Key: AVRO-656
> URL: https://issues.apache.org/jira/browse/AVRO-656
> Project: Avro
> Issue Type: Bug
> Components: java
> Affects Versions: 1.4.0
> Reporter: Doug Cutting
> Assignee: Doug Cutting
> Attachments: AVRO-656.patch
>
>
> According to the specification, a union may contain multiple instances of a
> named type, provided they have different names. There are several bugs in
> the Java implementation of this when writing data:
> - for record, only the short-name of the record is checked, so the branch
> for a record of the same name in a different namespace may be used by mistake
> - for enum and fixed, the name of the record is not checked, so the first
> enum or fixed in the union will always be assumed when writing. in many
> cases this may cause the wrong data to be written, potentially corrupting
> output.
> This is not a regression. This has never been implemented correctly by Java.
> Python and Ruby never check names, but rather perform a full, recursive
> validation of content.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.