[ 
https://issues.apache.org/jira/browse/AVRO-1006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13203256#comment-13203256
 ] 

Scott Carey commented on AVRO-1006:
-----------------------------------

{quote}
While representing the canonical schema as Avro data reduces it (compared to 
Json representation) it does not eliminate ambiguity. Non-empty arrays (and 
maps) can be represented in Avro in more than one way.
{quote}

Can you provide an example?  I am having trouble thinking of an example that 
doesn't fall under the other disambiguations in the document, e.g. 
{"type":"int"} == "int".  Can we have an avro serialized canonical form without 
the ambiguity?   The document describes two forms of normailzation: the Avro 
normalization, and the JSON normalization.  I don't think that something that 
has undergone the Avro normalization can have ambiguous array definition.  If 
so that would break both the JSON string fingerprint case as well as the avro 
binary fingerprint.  I am suggesting we avoid the JSON normalization and its 
dependency on JSON serializers that support ordering by using an avro binary 
representation for the input to a hash.  Both cases require the Avro 
normalization component.

It might be useful to ALWAYS store schemas in memory in normalized form -- with 
attributes and doc represented and attached separately.  The PRIMITIVES, 
FULLNAME, and MINIMIZE optimizatoins can always be applied, the STRIP 
optimization can be trivial by using a shared canonical schema.  For example, a 
Schema can have a CanonicalSchema member variable, plus attributes and doc.  
Two Schema.Fixed that only vary on their doc or attributes would share thee 
same CanonicalSchema.  Is it ever useful to know that two schemas differ only 
due to FULLNAME, MINIMIZE, or PRIMITIVE expansion?
                
> Fingerprints for Avro Schemas
> -----------------------------
>
>                 Key: AVRO-1006
>                 URL: https://issues.apache.org/jira/browse/AVRO-1006
>             Project: Avro
>          Issue Type: New Feature
>          Components: java
>            Reporter: Raymie Stata
>            Assignee: Raymie Stata
>              Labels: features
>         Attachments: schema-fingerprinting.html, schema-fingerprinting.html, 
> schema-fingerprinting.html
>
>
> Add function that returns a standardized, 64-bit fingerprint for schemas.  
> Fingerprints are designed such that the chances of collisions is very, very 
> low.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to