There was an annotation introduced in 2.37 to make sure we get the same
order of fields in schema inferred from a POJO.
https://javadoc.io/doc/org.apache.beam/beam-sdks-java-core/latest/org/apache/beam/sdk/schemas/annotations/SchemaFieldNumber.html

with that annotation schemaRegistry.getSchema(dataClass) should give you
schema with the same field order.


On Wed, Apr 6, 2022 at 1:35 AM Alexey Romanenko <aromanenko....@gmail.com>
wrote:

> Thanks for answers, Reuven. Please see the additional questions inline.
>
> On 5 Apr 2022, at 20:07, Reuven Lax <re...@google.com> wrote:
>
> On Tue, Apr 5, 2022 at 9:55 AM Alexey Romanenko <aromanenko....@gmail.com>
> wrote:
>
>>
>> So, the different fields order matters.
>>
>> Additionally, since "Schema.equals()” is used in "Row.equals()”, then it
>> means that two Rows with different-ordered schemas but the same values will
>> be considered as different rows. Is it correct?
>>
>
> Yes, but there are ways of dealing with this:
>
>
> But what is a point of this? Why the fields order can be important, under
> which circumstances?
>
> 1. If using Dataflow, the pipeline update feature allows you to update to
> a compatible schema (i.e. one in which the fields have the same names but a
> different order)
> 2.You can use the Convert transform to convert rows to a compatible schema
> with a different order.
>
>
> Well, for now it’s mostly related to unit tests (e.g.
> AvroSchemaTest.testPojoRecordToRow()) when we compare a manually created
> row with another row that is created from a POJO with AvroRecordSchema. I’m
> playing with an Avro version upgrade [1] and it fails because there are
> some changes in Avro and it creates an Avro schema with a different order
> of fields. So, actually I’m thinking what we can do here with that.
>
> [1] https://github.com/apache/beam/pull/17246
>
>
>> In the same time, while generating a schema with different schema
>> providers, the order of fields can be non-deterministic for some cases.
>>
>> For example, “GetterBasedSchemaProvider.toRowFunction(TypeDescriptor)”
>> says [3] that:
>> *- “schemaFor is non deterministic - it might return fields in an
>> arbitrary order. The reason why is that Java reflection does not guarantee
>> the order in which it returns fields and methods, and these schemas are
>> often based on reflective analysis of classes. “*
>>
>> So, iiuc, it means that potentially we can have the "same" schema but
>> with different fields order for the same, for example, POJO class but
>> generated on different JVMs.
>>
>
> Correct, and see above.
>
>
>>
>> And actually the questions:
>> - Two Rows with the same field values but with two schemas of different
>> fields order should be considered as two different rows or not?
>> - This behaviour explained above - is this that was expected by initial
>> schema design?
>> - If fields order is so important then why?
>>
>> PS: My question is actually related to
>> "AvroRecordSchema().toRowFunction()” but I guess other SchemaProvider’s
>> also can be affected.
>>
>>
>> —
>> Alexey
>>
>> [1]
>> https://beam.apache.org/documentation/programming-guide/#schema-definition
>> [2]
>> https://github.com/apache/beam/blob/0262ee53c6018d929a8a40fdf66735cc7e934951/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/Schema.java#L303
>> [3]
>> https://github.com/apache/beam/blob/0262ee53c6018d929a8a40fdf66735cc7e934951/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/GetterBasedSchemaProvider.java#L91
>>
>
>

Reply via email to