[ 
https://issues.apache.org/jira/browse/AVRO-656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12906075#action_12906075
 ] 

Scott Carey commented on AVRO-656:
----------------------------------

OK, I'm going to review all my in use schemas and see what the above options 
would break.

First, there is the schema used to represent an arbitrary Pig field, which the 
second alternative would break:

{code}
    List<Schema> pigTypes = new ArrayList<Schema>();
    pigTypes.add(Schema.create(Type.NULL));
    pigTypes.add(Schema.create(Type.BOOLEAN));
    pigTypes.add(Schema.create(Type.INT));
    pigTypes.add(Schema.create(Type.LONG));
    pigTypes.add(Schema.create(Type.FLOAT));
    pigTypes.add(Schema.create(Type.DOUBLE));
    pigTypes.add(Schema.create(Type.STRING));
    pigTypes.add(Schema.create(Type.BYTES));
    pigTypes.add(Schema.createArray(GENERIC_TUPLE));
    pigTypes.add(GENERIC_TUPLE);  // Tuple is a record containing a list of 
fields of type GENERIC_FIELD_UNION
    pigTypes.add(GENERIC_ELEMENT_MAP); // Map is a map from String to 
GENERIC_FIELD_UNION
    GENERIC_FIELD_UNION = Schema.createUnion(pigTypes);
{code}

I had tried to create an enum with multiple fixed types and ran into issues 
long ago.  I thought I was doing something wrong, actually.
I have long since wrapped these in a record.  So I have avoided this bug due to 
that:
{code}
[
{"name": "com.rr.avro.Fixed16", "type": "fixed", "size":16},
{"name": "com.rr.avro.Fixed4", "type": "fixed", "size":4},
{"name": "com.rr.avro.MyRecord", "type": "record", "fields": [
  {"name": "hostIp", "type": ["Fixed4", "Fixed16"], "doc": "should always be 4 
bytes (IPv4) or 16 bytes (IPv6)"},
   ... (more fields)
  }}
]
{code}

I have some other unions like this that are important:
["Fixed16", "string", "null"]


So in short, I think the first option makes sense from my use cases and the 
second one is very restrictive.  
It might make sense to simplify it and say that enum and/or fixed are not 
allowed in UNION at all -- they must be wrapped in a named record.  Limiting it 
to only one of each might be somewhat useful, but be more complicated.  

Alternatively, making some or all of the unnamed types named might help too.

Making only one symbolic type allowed in a union is restrictive, especially 
since I already have use cases for combining fixed, string, and bytes in a 
union. 

What about something like:
['BrowserTypeEnum", "string"] as a union.  BrowserTypeEnum is a canonicalized 
set of known browsers.  If a user-agent string can't be bucketed into one of 
the known types, its full string is stored instead.  Sure, we could instead 
have a record with an enum and a nullable string in it instead, but now you 
have a case where it could be both types at once.  The purpose of the Union is 
to guarantee its only one of the branches.

> writing unions with multiple records, fixed or enums can choose wrong branch 
> -----------------------------------------------------------------------------
>
>                 Key: AVRO-656
>                 URL: https://issues.apache.org/jira/browse/AVRO-656
>             Project: Avro
>          Issue Type: Bug
>          Components: java
>    Affects Versions: 1.4.0
>            Reporter: Doug Cutting
>            Assignee: Doug Cutting
>         Attachments: AVRO-656.patch
>
>
> According to the specification, a union may contain multiple instances of a 
> named type, provided they have different names.  There are several bugs in 
> the Java implementation of this when writing data:
>  - for record, only the short-name of the record is checked, so the branch 
> for a record of the same name in a different namespace may be used by mistake
>  - for enum and fixed, the name of the record is not checked, so the first 
> enum or fixed in the union will always be assumed when writing.  in many 
> cases this may cause the wrong data to be written, potentially corrupting 
> output.
> This is not a regression.  This has never been implemented correctly by Java. 
>  Python and Ruby never check names, but rather perform a full, recursive 
> validation of content.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to