Nice! Thanks for the very thorough summary. I think this will be a really
good thing for Beam. Most of the IO sources are very highly optimized for
querying and will do it more efficiently than the Beam runner when the
structure of the query matches. I'm really excited to see the performance
measurements.

A have a thought: your update did not mention a few extensions that we
might consider: ParquetIO, CassandraIO/HBaseIO/BigTableIO (all should be
about the same), JdbcIO, IcebergIO (doesn't exist yet, but is basically
generalized schema-aware files as I understand it). Are these things you
are thinking about doing, or would these be Jiras that could potentially be
tagged "starter"? They seem complex but maybe your framework will make it
feasible for someone with slightly less experience to implement new
versions of what you have already finished?

Kenn

On Tue, Nov 26, 2019 at 12:19 PM Kirill Kozlov <kirillkoz...@google.com>
wrote:

> Hello everyone!
>
> I have been working on a push-down feature and would like to give a brief
> update on what is done and is still under works.
>
> *Things that are done*:
> General API for SQL IOs to provide information about what filters/projects
> they support [1]:
> - *Filter* can be unsupported, supported with field reordering, and
> supported without field reordering.
> - *Predicate* is broken down into a conjunctive normal form (CNF) and
> passed to a validator class to check what parts are supported or
> unsupported by an IO.
>
> A Calcite rule [2] that checks for push-down support, constructs a new IO
> source Rel [3] with pushed-down projects and filters when applicable, and
> preserves unsupported filters/projects.
>
> BigQuery should perform push-down when running queries in DIRECT_READ
> method [4].
>
> MongoDB project push-down support is in a PR [5] and predicate support
> will be added soon.
>
>
> *Things that are in progress:*
> Documenting how developers can enable push-down for IOs that support it.
>
> Documenting certain limitation for BigQuery push-down (ex: comparing
> values of 2 columns is not supported at the moment, so it is being
> preserved in a Calc).
>
> Updating google-cloud-bigquerystorage to 0.117.0-beta. Earlier versions
> have a gRPC message limit set to ~11MB, which may cause some pipelies to
> break when reading from a table with rows larger than the limit.
>
> Adding some sort of performance tests to run continuously to
> measure speed-up and detect regressions.
>
> Deciding how cost should be computed for the IO source Rel with push-down
> [6]. Right now the following formula is used: cost of an IO without
> push-down minus the normalized (between 0.0 and 1.0) benefit of a performed
> push-down.
> The challenge here is to make the change to the cost small enough to not
> break join reordering, but large enough to make the optimizer favor
> pushed-down IO.
>
>
> If you have any suggestions/questions/concerns I would love to hear them.
>
> [1]
> https://github.com/apache/beam/blob/master/sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/meta/BeamSqlTable.java#L36
> [2]
> https://github.com/apache/beam/blob/master/sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/impl/rule/BeamIOPushDownRule.java
> [3]
> https://github.com/apache/beam/blob/master/sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/impl/rel/BeamPushDownIOSourceRel.java
> [4]
> https://github.com/apache/beam/blob/master/sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/meta/provider/bigquery/BigQueryTable.java#L128
> [5] https://github.com/apache/beam/pull/10095
> [6] https://github.com/apache/beam/pull/10060
>
> --
> Kirill
>

Reply via email to