Nice, this should bring a great performance improvement for SQL. Thanks for your work!
On Thu, Nov 28, 2019 at 6:33 AM Kenneth Knowles <[email protected]> wrote: > Nice! Thanks for the very thorough summary. I think this will be a really > good thing for Beam. Most of the IO sources are very highly optimized for > querying and will do it more efficiently than the Beam runner when the > structure of the query matches. I'm really excited to see the performance > measurements. > > A have a thought: your update did not mention a few extensions that we > might consider: ParquetIO, CassandraIO/HBaseIO/BigTableIO (all should be > about the same), JdbcIO, IcebergIO (doesn't exist yet, but is basically > generalized schema-aware files as I understand it). Are these things you > are thinking about doing, or would these be Jiras that could potentially be > tagged "starter"? They seem complex but maybe your framework will make it > feasible for someone with slightly less experience to implement new > versions of what you have already finished? > > Kenn > > On Tue, Nov 26, 2019 at 12:19 PM Kirill Kozlov <[email protected]> > wrote: > >> Hello everyone! >> >> I have been working on a push-down feature and would like to give a brief >> update on what is done and is still under works. >> >> *Things that are done*: >> General API for SQL IOs to provide information about what >> filters/projects they support [1]: >> - *Filter* can be unsupported, supported with field reordering, and >> supported without field reordering. >> - *Predicate* is broken down into a conjunctive normal form (CNF) and >> passed to a validator class to check what parts are supported or >> unsupported by an IO. >> >> A Calcite rule [2] that checks for push-down support, constructs a new IO >> source Rel [3] with pushed-down projects and filters when applicable, and >> preserves unsupported filters/projects. >> >> BigQuery should perform push-down when running queries in DIRECT_READ >> method [4]. >> >> MongoDB project push-down support is in a PR [5] and predicate support >> will be added soon. >> >> >> *Things that are in progress:* >> Documenting how developers can enable push-down for IOs that support it. >> >> Documenting certain limitation for BigQuery push-down (ex: comparing >> values of 2 columns is not supported at the moment, so it is being >> preserved in a Calc). >> >> Updating google-cloud-bigquerystorage to 0.117.0-beta. Earlier versions >> have a gRPC message limit set to ~11MB, which may cause some pipelies to >> break when reading from a table with rows larger than the limit. >> >> Adding some sort of performance tests to run continuously to >> measure speed-up and detect regressions. >> >> Deciding how cost should be computed for the IO source Rel with push-down >> [6]. Right now the following formula is used: cost of an IO without >> push-down minus the normalized (between 0.0 and 1.0) benefit of a performed >> push-down. >> The challenge here is to make the change to the cost small enough to not >> break join reordering, but large enough to make the optimizer favor >> pushed-down IO. >> >> >> If you have any suggestions/questions/concerns I would love to hear them. >> >> [1] >> https://github.com/apache/beam/blob/master/sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/meta/BeamSqlTable.java#L36 >> [2] >> https://github.com/apache/beam/blob/master/sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/impl/rule/BeamIOPushDownRule.java >> [3] >> https://github.com/apache/beam/blob/master/sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/impl/rel/BeamPushDownIOSourceRel.java >> [4] >> https://github.com/apache/beam/blob/master/sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/meta/provider/bigquery/BigQueryTable.java#L128 >> [5] https://github.com/apache/beam/pull/10095 >> [6] https://github.com/apache/beam/pull/10060 >> >> -- >> Kirill >> >
