On Tue, Apr 7, 2020 at 4:03 AM roger peppe <rogpe...@gmail.com> wrote:

> On the one hand the specification says
> <https://avro.apache.org/docs/1.9.1/spec.html#Parsing+Canonical+Form+for+Schemas>
> :
>
> If the Parsing Canonical Forms of two different schemas are textually
>> equal, then those schemas are "the same" as far as any reader is concerned
>
>
This statement in the specification could perhaps be improved.  What it
means is that low-level parsing errors will not be encountered when using
two such schemas.  It does not mean they're equivalent for all purposes.


> but on the other, when discussing the decimal logical type, it says:
>
> For the purposes of schema resolution, two schemas that are decimal logical
>> types *match* if their scales and precisions match.
>
>
>
Schema resolution involves a different kind of equivalence for schemas.
Two compatible schemas here may have quite different binary formats, fields
might be reordered, removed, or added.  Scalar types may be promoted, etc.


> Given that the spec recommends using the canonical form for schema
> fingerprints, ISTM there might be some possibility for attack (or at least
> data corruption) there - if we unwittingly read a decimal value that was
> written with a different scale, we could read numbers thinking they're a
> different order of magnitude than they actually are.
>

Identical Parsing Canonical form only tells you whether you can parse the
data, not whether you can resolve it.  Indeed, if you use a different
logical type definition but only check parsing-level compatibility then you
can get incorrect data.

There is a proposal to add an alternate canonical form that incorporates
logical types:

https://github.com/apache/avro/pull/805
https://issues.apache.org/jira/browse/AVRO-2299

Does this look like what you'd like?  It seems that patch has been ignored,
but perhaps we can pick it up again and get it committed.

Thanks,

Doug

Does this

Reply via email to