Hey Tao, It does look like BEAM-11460 could work for you. Note that relies on a dynamic object which won't work with schema-aware transforms and SqlTransform. It's likely this isn't a problem for you, I just wanted to point it out.
Out of curiosity, for your use-case would it be acceptable if Beam peaked at the files at pipeline construction time to determine the schema for you? This is what we're doing for the new IOs in the Python SDK's DataFrame API. They're based on the pandas read_* methods, and use those methods at construction time to determine the schema. Brian On Wed, Jan 6, 2021 at 10:13 AM Alexey Romanenko <[email protected]> wrote: > Hi Tao, > > This jira [1] looks exactly what you are asking but it was merged recently > (thanks to Anant Damle for working on this!) and it should be available > only in Beam 2.28.0. > > [1] https://issues.apache.org/jira/browse/BEAM-11460 > > Regards, > Alexey > > On 6 Jan 2021, at 18:57, Tao Li <[email protected]> wrote: > > Hi beam community, > > Quick question about ParquetIO > <https://beam.apache.org/releases/javadoc/2.25.0/org/apache/beam/sdk/io/parquet/ParquetIO.html>. > Is there a way to avoid specifying the avro schema when reading parquet > files? The reason is that we may not know the parquet schema until we read > the files. In comparison, spark parquet reader > <https://spark.apache.org/docs/latest/sql-data-sources-parquet.html> does > not require such a schema specification. > > Please advise. Thanks a lot! > > >
