I am trying to read a set of Parquet files from GCS and it is failing
because the order of the Parquet columns does not match the order of fields
defined by the SchemaBuilder.

Ex:
I am defining my Schema as the following:
public static Schema buildSchema() {
    SchemaBuilder.FieldAssembler<Schema> builder =
SchemaBuilder.record("schema").fields();
    builder.optionalString("A");
    builder.optionalLong("B");
    builder.optionalDouble("C");
    builder.optionalDouble("D");

    return builder.endRecord();
  }

And the parquet schema that I am attempting to read is returning an order
of:
D,A,B,C

When I try to read the parquet files using:

pipeline.apply("Read parquet", ParquetIO.read(schema).from(path))
...

It fails with:

java.lang.IllegalArgumentException: Unable to encode element '{"D": 1.0,
"A": "stringA", "B": 100, "C": 2.0}' with coder
'org.apache.beam.sdk.coders.AvroGenericCoder@f7996a3b'.

Caused by: org.apache.avro.UnresolvedUnionException: Not in union
["null","string"]: 1.0

My question is if there is a way to have ParquetIO ignore the order of the
columns or does it have to match exactly?

For reference I am executing this pipeline on Dataflow v 2.23.0

Thanks for your help,

Joe

Reply via email to