> > ParquetIO, CassandraIO/HBaseIO/BigTableIO (all should be about the same), > JdbcIO, IcebergIO (doesn't exist yet, but is basically generalized > schema-aware files as I understand it).
I think that adding Jiras with a tag "starter" for implementing push-down for all of the IO interfaces listed above would be a good start. The design doc does have an example for project push-down; predicate push-down example is in the works. Hopefully, that will make it straight forward for new contributors. On Thu, Nov 28, 2019 at 4:32 AM David Morávek <david.mora...@gmail.com> wrote: > Nice, this should bring a great performance improvement for SQL. Thanks > for your work! > > On Thu, Nov 28, 2019 at 6:33 AM Kenneth Knowles <k...@apache.org> wrote: > >> Nice! Thanks for the very thorough summary. I think this will be a really >> good thing for Beam. Most of the IO sources are very highly optimized for >> querying and will do it more efficiently than the Beam runner when the >> structure of the query matches. I'm really excited to see the performance >> measurements. >> >> A have a thought: your update did not mention a few extensions that we >> might consider: ParquetIO, CassandraIO/HBaseIO/BigTableIO (all should be >> about the same), JdbcIO, IcebergIO (doesn't exist yet, but is basically >> generalized schema-aware files as I understand it). Are these things you >> are thinking about doing, or would these be Jiras that could potentially be >> tagged "starter"? They seem complex but maybe your framework will make it >> feasible for someone with slightly less experience to implement new >> versions of what you have already finished? >> >> Kenn >> >> On Tue, Nov 26, 2019 at 12:19 PM Kirill Kozlov <kirillkoz...@google.com> >> wrote: >> >>> Hello everyone! >>> >>> I have been working on a push-down feature and would like to give a >>> brief update on what is done and is still under works. >>> >>> *Things that are done*: >>> General API for SQL IOs to provide information about what >>> filters/projects they support [1]: >>> - *Filter* can be unsupported, supported with field reordering, and >>> supported without field reordering. >>> - *Predicate* is broken down into a conjunctive normal form (CNF) and >>> passed to a validator class to check what parts are supported or >>> unsupported by an IO. >>> >>> A Calcite rule [2] that checks for push-down support, constructs a new >>> IO source Rel [3] with pushed-down projects and filters when applicable, >>> and preserves unsupported filters/projects. >>> >>> BigQuery should perform push-down when running queries in DIRECT_READ >>> method [4]. >>> >>> MongoDB project push-down support is in a PR [5] and predicate support >>> will be added soon. >>> >>> >>> *Things that are in progress:* >>> Documenting how developers can enable push-down for IOs that support it. >>> >>> Documenting certain limitation for BigQuery push-down (ex: comparing >>> values of 2 columns is not supported at the moment, so it is being >>> preserved in a Calc). >>> >>> Updating google-cloud-bigquerystorage to 0.117.0-beta. Earlier versions >>> have a gRPC message limit set to ~11MB, which may cause some pipelies to >>> break when reading from a table with rows larger than the limit. >>> >>> Adding some sort of performance tests to run continuously to >>> measure speed-up and detect regressions. >>> >>> Deciding how cost should be computed for the IO source Rel with >>> push-down [6]. Right now the following formula is used: cost of an IO >>> without push-down minus the normalized (between 0.0 and 1.0) benefit of a >>> performed push-down. >>> The challenge here is to make the change to the cost small enough to not >>> break join reordering, but large enough to make the optimizer favor >>> pushed-down IO. >>> >>> >>> If you have any suggestions/questions/concerns I would love to hear them. >>> >>> [1] >>> https://github.com/apache/beam/blob/master/sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/meta/BeamSqlTable.java#L36 >>> [2] >>> https://github.com/apache/beam/blob/master/sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/impl/rule/BeamIOPushDownRule.java >>> [3] >>> https://github.com/apache/beam/blob/master/sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/impl/rel/BeamPushDownIOSourceRel.java >>> [4] >>> https://github.com/apache/beam/blob/master/sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/meta/provider/bigquery/BigQueryTable.java#L128 >>> [5] https://github.com/apache/beam/pull/10095 >>> [6] https://github.com/apache/beam/pull/10060 >>> >>> -- >>> Kirill >>> >>