Re: [Question] Beam Schema, fields order

Alexey Romanenko Wed, 06 Apr 2022 08:22:38 -0700

> On 6 Apr 2022, at 03:01, Reuven Lax <re...@google.com> wrote:
> 
> The reason is that the fields are encoded and decoded in order when encoding 
> a Row. The encoded version of Row does not include field names (for reasons 
> of performance - it would be much slower and more expensive if each record 
> had to include all the field names).


Just as an idea - should we keep and use internally a “normalised” schema in 
such case to avoid these issues with fields ordering? 

> Check out SchemaTestUtils.equivalentTo. It should allow you to test that two 
> rows are equivalent (i.e. have the same fields, but possibly in a different 
> order).

Thanks, I already did that for another test - AvroSchemaTest.testPojoSchema() - 
where we compare the schema [1], not rows. 

Though, I’m not sure this is a right workaround since if the goal of this test 
is to check that we have the SAME Beam schema that is created from AvroPojo and 
default Beam Pojo schema then it’s not correct because, as you said above, from 
the Beam perspective they will be considered as two different schemas because 
of different fields order. 

The same issue for AvroSchemaTest.testPojoRecordToRow() test, where we compare 
rows, and it fails since

class Row {
  boolean equals() { 
    …
    if (!Objects.equals(getSchema(), other.getSchema())) {
      return false;
    }
    …
  }
} 

[1] 
https://github.com/apache/beam/pull/17246/files#diff-ca874b6d378d007a590c7eb781635275623fd6d300ab1330f73c29951e7dc505R380

—
Alexey



>  
> 
> [1] https://github.com/apache/beam/pull/17246 
> <https://github.com/apache/beam/pull/17246>
> 
>> 
>> In the same time, while generating a schema with different schema providers, 
>> the order of fields can be non-deterministic for some cases.
>> 
>> For example, “GetterBasedSchemaProvider.toRowFunction(TypeDescriptor)” says 
>> [3] that:
>> - “schemaFor is non deterministic - it might return fields in an arbitrary 
>> order. The reason why is that Java reflection does not guarantee the order 
>> in which it returns fields and methods, and these schemas are often based on 
>> reflective analysis of classes. “
>> 
>> So, iiuc, it means that potentially we can have the "same" schema but with 
>> different fields order for the same, for example, POJO class but generated 
>> on different JVMs. 
>> 
>> Correct, and see above.
>>  
>> 
>> And actually the questions: 
>> - Two Rows with the same field values but with two schemas of different 
>> fields order should be considered as two different rows or not?
>> - This behaviour explained above - is this that was expected by initial 
>> schema design? 
>> - If fields order is so important then why?
>> 
>> PS: My question is actually related to "AvroRecordSchema().toRowFunction()” 
>> but I guess other SchemaProvider’s also can be affected.
>> 
>> 
>> —
>> Alexey
>> 
>> [1] 
>> https://beam.apache.org/documentation/programming-guide/#schema-definition 
>> <https://beam.apache.org/documentation/programming-guide/#schema-definition>
>> [2] 
>> https://github.com/apache/beam/blob/0262ee53c6018d929a8a40fdf66735cc7e934951/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/Schema.java#L303
>>  
>> <https://github.com/apache/beam/blob/0262ee53c6018d929a8a40fdf66735cc7e934951/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/Schema.java#L303>
>> [3] 
>> https://github.com/apache/beam/blob/0262ee53c6018d929a8a40fdf66735cc7e934951/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/GetterBasedSchemaProvider.java#L91
>>  
>> <https://github.com/apache/beam/blob/0262ee53c6018d929a8a40fdf66735cc7e934951/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/GetterBasedSchemaProvider.java#L91>

Re: [Question] Beam Schema, fields order

Reply via email to