Dominik Mautz created AVRO-4004: ----------------------------------- Summary: [Rust] Canonical form transformation does not strip the logicalType Key: AVRO-4004 URL: https://issues.apache.org/jira/browse/AVRO-4004 Project: Apache Avro Issue Type: Bug Components: rust Reporter: Dominik Mautz
The Rust implementation of for the canonical transformation does not strip the _logicalType_ as required by the [STRIP] rule ([https://avro.apache.org/docs/1.11.0/spec.html#Transforming+into+Parsing+Canonical+Form]). This results in different fingerprints for the same schema compared to other implementations (at least for Python and Java) This is for instance can become an issue for the kafka-delta-ingest ([https://github.com/delta-io/kafka-delta-ingest]). Rust {code:java} [package] name = "avro issue" version = "0.2.0" edition = "2018" [dependencies] apache-avro = "0.16.0" anyhow = "1.0.86" {code} {code:java} use anyhow::Result; use apache_avro::{rabin::Rabin, Schema}; use sha2::Sha256; fn main() -> Result<()> { let schema_str = r#" { "type": "record", "name": "test", "fields": [ {"name": "a", "type": "long", "default": 42, "doc": "The field a"}, {"name": "b", "type": "string", "namespace": "test.a"}, {"name": "c", "type": "long", "logicalType": "timestamp-micros"} ] }"#; let schema = Schema::parse_str(schema_str)?; let canonical_form = schema.canonical_form(); let fp_rabin = schema.fingerprint::<Rabin>(); println!("Canonical form: {}", canonical_form); println!("Rabin fingerprint: {}", fp_rabin); Ok(()) } {code} Output: {code:java} Canonical form: {"name":"test","type":"record","fields":[{"name":"a","type":"long"},{"name":"b","type":"string"},{"name":"c","type":{"type":"long","logicalType":"timestamp-micros"}}]} Rabin fingerprint: 28cf0a67d9937bb3 {code} As you can see, the _logicalType_ is still present in the "canonical form." Python {code:python} import avro.schema schema_str = """ { "type": "record", "name": "test", "fields": [ {"name": "a", "type": "long", "default": 42, "doc": "The field a"}, {"name": "b", "type": "string", "namespace": "test.a"}, {"name": "c", "type": "long", "logicalType": "timestamp-micros"} ] }""" schema = avro.schema.parse(schema_str) print(f"Canonical form: {schema.canonical_form}") print(f"Rabin fingerprint: {schema.fingerprint().hex()}") {code} Output: {code:java} Canonical form: {"name":"test","type":"record","fields":[{"name":"a","type":"long"},{"name":"b","type":"string"},{"name":"c","type":"long"}]} Rabin fingerprint: 385501e341b00a1c {code} Java returns the same output as python. Imho, I think that changing the line [https://github.com/apache/avro/blob/main/lang/rust/avro/src/schema.rs#L2159] to {code:java} //... if field_ordering_position(k).is_none() || k == "default" || k == "doc" || k == "aliases" || k == "logicalType" { //... {code} should resolve the issue. However, I am unsure if this line should actually include more even attributes (other than the currently explicitly stated). Nevertheless, the test in [https://github.com/apache/avro/blob/fdab5db0816e28e3e10c87910c8b6f98c33072dc/lang/rust/avro/src/schema.rs#L3388] must also be adopted to reflect the correct transformation of the canonical form and the corresponding fingerprint. Rabin: 385501e341b00a1c MD5: 384f46367ef8c22dbbf44109b82ff7aa SHA-256: 8e72f58f2d84a59d6a08e8db5fdc6484dee35babf33179cea72889ae63083f36 -- This message was sent by Atlassian Jira (v8.20.10#820010)