Re: Quick question regarding ParquetIO

Brian Hulette Wed, 06 Jan 2021 10:43:22 -0800

Hey Tao,

It does look like BEAM-11460 could work for you. Note that relies on a
dynamic object which won't work with schema-aware transforms and
SqlTransform. It's likely this isn't a problem for you, I just wanted to
point it out.

Out of curiosity, for your use-case would it be acceptable if Beam peaked
at the files at pipeline construction time to determine the schema for you?
This is what we're doing for the new IOs in the Python SDK's DataFrame API.
They're based on the pandas read_* methods, and use those methods at
construction time to determine the schema.

Brian

On Wed, Jan 6, 2021 at 10:13 AM Alexey Romanenko <[email protected]>
wrote:

> Hi Tao,
>
> This jira [1] looks exactly what you are asking but it was merged recently
> (thanks to Anant Damle for working on this!) and it should be available
> only in Beam 2.28.0.
>
> [1] https://issues.apache.org/jira/browse/BEAM-11460
>
> Regards,
> Alexey
>
> On 6 Jan 2021, at 18:57, Tao Li <[email protected]> wrote:
>
> Hi beam community,
>
> Quick question about ParquetIO
> <https://beam.apache.org/releases/javadoc/2.25.0/org/apache/beam/sdk/io/parquet/ParquetIO.html>.
> Is there a way to avoid specifying the avro schema when reading parquet
> files? The reason is that we may not know the parquet schema until we read
> the files. In comparison, spark parquet reader
> <https://spark.apache.org/docs/latest/sql-data-sources-parquet.html> does
> not require such a schema specification.
>
> Please advise. Thanks a lot!
>
>
>

Re: Quick question regarding ParquetIO

Reply via email to