Nice! Thanks for the very thorough summary. I think this will be a really good thing for Beam. Most of the IO sources are very highly optimized for querying and will do it more efficiently than the Beam runner when the structure of the query matches. I'm really excited to see the performance measurements.
A have a thought: your update did not mention a few extensions that we might consider: ParquetIO, CassandraIO/HBaseIO/BigTableIO (all should be about the same), JdbcIO, IcebergIO (doesn't exist yet, but is basically generalized schema-aware files as I understand it). Are these things you are thinking about doing, or would these be Jiras that could potentially be tagged "starter"? They seem complex but maybe your framework will make it feasible for someone with slightly less experience to implement new versions of what you have already finished? Kenn On Tue, Nov 26, 2019 at 12:19 PM Kirill Kozlov <kirillkoz...@google.com> wrote: > Hello everyone! > > I have been working on a push-down feature and would like to give a brief > update on what is done and is still under works. > > *Things that are done*: > General API for SQL IOs to provide information about what filters/projects > they support [1]: > - *Filter* can be unsupported, supported with field reordering, and > supported without field reordering. > - *Predicate* is broken down into a conjunctive normal form (CNF) and > passed to a validator class to check what parts are supported or > unsupported by an IO. > > A Calcite rule [2] that checks for push-down support, constructs a new IO > source Rel [3] with pushed-down projects and filters when applicable, and > preserves unsupported filters/projects. > > BigQuery should perform push-down when running queries in DIRECT_READ > method [4]. > > MongoDB project push-down support is in a PR [5] and predicate support > will be added soon. > > > *Things that are in progress:* > Documenting how developers can enable push-down for IOs that support it. > > Documenting certain limitation for BigQuery push-down (ex: comparing > values of 2 columns is not supported at the moment, so it is being > preserved in a Calc). > > Updating google-cloud-bigquerystorage to 0.117.0-beta. Earlier versions > have a gRPC message limit set to ~11MB, which may cause some pipelies to > break when reading from a table with rows larger than the limit. > > Adding some sort of performance tests to run continuously to > measure speed-up and detect regressions. > > Deciding how cost should be computed for the IO source Rel with push-down > [6]. Right now the following formula is used: cost of an IO without > push-down minus the normalized (between 0.0 and 1.0) benefit of a performed > push-down. > The challenge here is to make the change to the cost small enough to not > break join reordering, but large enough to make the optimizer favor > pushed-down IO. > > > If you have any suggestions/questions/concerns I would love to hear them. > > [1] > https://github.com/apache/beam/blob/master/sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/meta/BeamSqlTable.java#L36 > [2] > https://github.com/apache/beam/blob/master/sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/impl/rule/BeamIOPushDownRule.java > [3] > https://github.com/apache/beam/blob/master/sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/impl/rel/BeamPushDownIOSourceRel.java > [4] > https://github.com/apache/beam/blob/master/sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/meta/provider/bigquery/BigQueryTable.java#L128 > [5] https://github.com/apache/beam/pull/10095 > [6] https://github.com/apache/beam/pull/10060 > > -- > Kirill >