Hi dev@,

A quick update about a new Beam SQL feature.

In short, we have wired up the support for plugging table providers through
Beam SQL API to allow obtaining table schemas from external sources.

*What does it even mean?*

Previously, in Java pipelines, you could apply a Beam SQL query to existing
PCollections. We have a special SqlTransform to do that, it converts a SQL
query to an equivalent PTransform that is applied to the PCollection of Rows
.

One major inconvenience in this approach is that to query something, it has
to be a PCollection. I.e. you have to read the data from a specific source
and then convert it to rows. Which can mean multiple complications, like
potentially manually converting schemas from source to Beam, or having a
completely different logic when changing the source.

The new API allows you to plug a schema provider that can resolve the
tables and schemas automatically if they already exist somewhere else. This
way Beam SQL, with the help of the provider, does the table lookup, then IO
configuration, and then schema conversion if needed.

As an example, here's a query
<https://github.com/apache/beam/blob/116600f32013620e748723b8022a7023fa8e2528/sdks/java/extensions/sql/src/test/java/org/apache/beam/sdk/extensions/sql/BeamSqlHiveSchemaTest.java#L175,L190>[1]
that joins 2 existing PCollections with a table from Hive using
HCatalogTableProvider. Hive table lookup is automatic, the table provider
in this case will resolve the tables by talking to Hive Metastore and will
read the data by configuring and applying the HCatalogIO, converting the
records to Rows under the hood.

*What's the status of this?*

This is a working implementation, but the development is still ongoing,
there are bugs, API might change, and there are few more things I can see
coming related to this after further design discussions:

 * refactor of the underlying table/metadata provider code;
 * working out the design for supporting creating / updating the tables in
the metadata provider;
 * creating a DDL syntax for it;
 * creating more providers;

[1]
https://github.com/apache/beam/blob/116600f32013620e748723b8022a7023fa8e2528/sdks/java/extensions/sql/src/test/java/org/apache/beam/sdk/extensions/sql/BeamSqlHiveSchemaTest.java#L175,L190

Reply via email to