On Tue, Apr 7, 2020 at 4:03 AM roger peppe <rogpe...@gmail.com> wrote:
> On the one hand the specification says > <https://avro.apache.org/docs/1.9.1/spec.html#Parsing+Canonical+Form+for+Schemas> > : > > If the Parsing Canonical Forms of two different schemas are textually >> equal, then those schemas are "the same" as far as any reader is concerned > > This statement in the specification could perhaps be improved. What it means is that low-level parsing errors will not be encountered when using two such schemas. It does not mean they're equivalent for all purposes. > but on the other, when discussing the decimal logical type, it says: > > For the purposes of schema resolution, two schemas that are decimal logical >> types *match* if their scales and precisions match. > > > Schema resolution involves a different kind of equivalence for schemas. Two compatible schemas here may have quite different binary formats, fields might be reordered, removed, or added. Scalar types may be promoted, etc. > Given that the spec recommends using the canonical form for schema > fingerprints, ISTM there might be some possibility for attack (or at least > data corruption) there - if we unwittingly read a decimal value that was > written with a different scale, we could read numbers thinking they're a > different order of magnitude than they actually are. > Identical Parsing Canonical form only tells you whether you can parse the data, not whether you can resolve it. Indeed, if you use a different logical type definition but only check parsing-level compatibility then you can get incorrect data. There is a proposal to add an alternate canonical form that incorporates logical types: https://github.com/apache/avro/pull/805 https://issues.apache.org/jira/browse/AVRO-2299 Does this look like what you'd like? It seems that patch has been ignored, but perhaps we can pick it up again and get it committed. Thanks, Doug Does this