[ https://issues.apache.org/jira/browse/AVRO-1006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13203256#comment-13203256 ]
Scott Carey commented on AVRO-1006: ----------------------------------- {quote} While representing the canonical schema as Avro data reduces it (compared to Json representation) it does not eliminate ambiguity. Non-empty arrays (and maps) can be represented in Avro in more than one way. {quote} Can you provide an example? I am having trouble thinking of an example that doesn't fall under the other disambiguations in the document, e.g. {"type":"int"} == "int". Can we have an avro serialized canonical form without the ambiguity? The document describes two forms of normailzation: the Avro normalization, and the JSON normalization. I don't think that something that has undergone the Avro normalization can have ambiguous array definition. If so that would break both the JSON string fingerprint case as well as the avro binary fingerprint. I am suggesting we avoid the JSON normalization and its dependency on JSON serializers that support ordering by using an avro binary representation for the input to a hash. Both cases require the Avro normalization component. It might be useful to ALWAYS store schemas in memory in normalized form -- with attributes and doc represented and attached separately. The PRIMITIVES, FULLNAME, and MINIMIZE optimizatoins can always be applied, the STRIP optimization can be trivial by using a shared canonical schema. For example, a Schema can have a CanonicalSchema member variable, plus attributes and doc. Two Schema.Fixed that only vary on their doc or attributes would share thee same CanonicalSchema. Is it ever useful to know that two schemas differ only due to FULLNAME, MINIMIZE, or PRIMITIVE expansion? > Fingerprints for Avro Schemas > ----------------------------- > > Key: AVRO-1006 > URL: https://issues.apache.org/jira/browse/AVRO-1006 > Project: Avro > Issue Type: New Feature > Components: java > Reporter: Raymie Stata > Assignee: Raymie Stata > Labels: features > Attachments: schema-fingerprinting.html, schema-fingerprinting.html, > schema-fingerprinting.html > > > Add function that returns a standardized, 64-bit fingerprint for schemas. > Fingerprints are designed such that the chances of collisions is very, very > low. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira