Dominik Mautz created AVRO-4004:
-----------------------------------
Summary: [Rust] Canonical form transformation does not strip the
logicalType
Key: AVRO-4004
URL: https://issues.apache.org/jira/browse/AVRO-4004
Project: Apache Avro
Issue Type: Bug
Components: rust
Reporter: Dominik Mautz
The Rust implementation of for the canonical transformation does not strip the
_logicalType_ as required by the [STRIP] rule
([https://avro.apache.org/docs/1.11.0/spec.html#Transforming+into+Parsing+Canonical+Form]).
This results in different fingerprints for the same schema compared to other
implementations (at least for Python and Java)
This is for instance can become an issue for the kafka-delta-ingest
([https://github.com/delta-io/kafka-delta-ingest]).
Rust
{code:java}
[package]
name = "avro issue"
version = "0.2.0"
edition = "2018"
[dependencies]
apache-avro = "0.16.0"
anyhow = "1.0.86"
{code}
{code:java}
use anyhow::Result;
use apache_avro::{rabin::Rabin, Schema};
use sha2::Sha256;
fn main() -> Result<()> {
let schema_str = r#"
{
"type": "record",
"name": "test",
"fields": [
{"name": "a", "type": "long", "default": 42, "doc": "The field a"},
{"name": "b", "type": "string", "namespace": "test.a"},
{"name": "c", "type": "long", "logicalType": "timestamp-micros"}
]
}"#;
let schema = Schema::parse_str(schema_str)?;
let canonical_form = schema.canonical_form();
let fp_rabin = schema.fingerprint::<Rabin>();
println!("Canonical form: {}", canonical_form);
println!("Rabin fingerprint: {}", fp_rabin);
Ok(())
}
{code}
Output:
{code:java}
Canonical form:
{"name":"test","type":"record","fields":[{"name":"a","type":"long"},{"name":"b","type":"string"},{"name":"c","type":{"type":"long","logicalType":"timestamp-micros"}}]}
Rabin fingerprint: 28cf0a67d9937bb3
{code}
As you can see, the _logicalType_ is still present in the "canonical form."
Python
{code:python}
import avro.schema
schema_str = """
{
"type": "record",
"name": "test",
"fields": [
{"name": "a", "type": "long", "default": 42, "doc": "The field a"},
{"name": "b", "type": "string", "namespace": "test.a"},
{"name": "c", "type": "long", "logicalType": "timestamp-micros"}
]
}"""
schema = avro.schema.parse(schema_str)
print(f"Canonical form: {schema.canonical_form}")
print(f"Rabin fingerprint: {schema.fingerprint().hex()}")
{code}
Output:
{code:java}
Canonical form:
{"name":"test","type":"record","fields":[{"name":"a","type":"long"},{"name":"b","type":"string"},{"name":"c","type":"long"}]}
Rabin fingerprint: 385501e341b00a1c
{code}
Java returns the same output as python.
Imho, I think that changing the line
[https://github.com/apache/avro/blob/main/lang/rust/avro/src/schema.rs#L2159]
to
{code:java}
//...
if field_ordering_position(k).is_none() || k == "default" || k == "doc" || k
== "aliases" || k == "logicalType" {
//...
{code}
should resolve the issue. However, I am unsure if this line should actually
include more even attributes (other than the currently explicitly stated).
Nevertheless, the test in
[https://github.com/apache/avro/blob/fdab5db0816e28e3e10c87910c8b6f98c33072dc/lang/rust/avro/src/schema.rs#L3388]
must also be adopted to reflect the correct transformation of the canonical
form and the corresponding fingerprint.
Rabin: 385501e341b00a1c
MD5: 384f46367ef8c22dbbf44109b82ff7aa
SHA-256: 8e72f58f2d84a59d6a08e8db5fdc6484dee35babf33179cea72889ae63083f36
--
This message was sent by Atlassian Jira
(v8.20.10#820010)