>
> ParquetIO, CassandraIO/HBaseIO/BigTableIO (all should be about the same),
> JdbcIO, IcebergIO (doesn't exist yet, but is basically generalized
> schema-aware files as I understand it).

I think that adding Jiras with a tag "starter" for implementing push-down
for all of the IO interfaces listed above would be a good start. The design
doc does have an example for project push-down; predicate push-down example
is in the works.
Hopefully, that will make it straight forward for new contributors.

On Thu, Nov 28, 2019 at 4:32 AM David Morávek <david.mora...@gmail.com>
wrote:

> Nice, this should bring a great performance improvement for SQL. Thanks
> for your work!
>
> On Thu, Nov 28, 2019 at 6:33 AM Kenneth Knowles <k...@apache.org> wrote:
>
>> Nice! Thanks for the very thorough summary. I think this will be a really
>> good thing for Beam. Most of the IO sources are very highly optimized for
>> querying and will do it more efficiently than the Beam runner when the
>> structure of the query matches. I'm really excited to see the performance
>> measurements.
>>
>> A have a thought: your update did not mention a few extensions that we
>> might consider: ParquetIO, CassandraIO/HBaseIO/BigTableIO (all should be
>> about the same), JdbcIO, IcebergIO (doesn't exist yet, but is basically
>> generalized schema-aware files as I understand it). Are these things you
>> are thinking about doing, or would these be Jiras that could potentially be
>> tagged "starter"? They seem complex but maybe your framework will make it
>> feasible for someone with slightly less experience to implement new
>> versions of what you have already finished?
>>
>> Kenn
>>
>> On Tue, Nov 26, 2019 at 12:19 PM Kirill Kozlov <kirillkoz...@google.com>
>> wrote:
>>
>>> Hello everyone!
>>>
>>> I have been working on a push-down feature and would like to give a
>>> brief update on what is done and is still under works.
>>>
>>> *Things that are done*:
>>> General API for SQL IOs to provide information about what
>>> filters/projects they support [1]:
>>> - *Filter* can be unsupported, supported with field reordering, and
>>> supported without field reordering.
>>> - *Predicate* is broken down into a conjunctive normal form (CNF) and
>>> passed to a validator class to check what parts are supported or
>>> unsupported by an IO.
>>>
>>> A Calcite rule [2] that checks for push-down support, constructs a new
>>> IO source Rel [3] with pushed-down projects and filters when applicable,
>>> and preserves unsupported filters/projects.
>>>
>>> BigQuery should perform push-down when running queries in DIRECT_READ
>>> method [4].
>>>
>>> MongoDB project push-down support is in a PR [5] and predicate support
>>> will be added soon.
>>>
>>>
>>> *Things that are in progress:*
>>> Documenting how developers can enable push-down for IOs that support it.
>>>
>>> Documenting certain limitation for BigQuery push-down (ex: comparing
>>> values of 2 columns is not supported at the moment, so it is being
>>> preserved in a Calc).
>>>
>>> Updating google-cloud-bigquerystorage to 0.117.0-beta. Earlier versions
>>> have a gRPC message limit set to ~11MB, which may cause some pipelies to
>>> break when reading from a table with rows larger than the limit.
>>>
>>> Adding some sort of performance tests to run continuously to
>>> measure speed-up and detect regressions.
>>>
>>> Deciding how cost should be computed for the IO source Rel with
>>> push-down [6]. Right now the following formula is used: cost of an IO
>>> without push-down minus the normalized (between 0.0 and 1.0) benefit of a
>>> performed push-down.
>>> The challenge here is to make the change to the cost small enough to not
>>> break join reordering, but large enough to make the optimizer favor
>>> pushed-down IO.
>>>
>>>
>>> If you have any suggestions/questions/concerns I would love to hear them.
>>>
>>> [1]
>>> https://github.com/apache/beam/blob/master/sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/meta/BeamSqlTable.java#L36
>>> [2]
>>> https://github.com/apache/beam/blob/master/sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/impl/rule/BeamIOPushDownRule.java
>>> [3]
>>> https://github.com/apache/beam/blob/master/sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/impl/rel/BeamPushDownIOSourceRel.java
>>> [4]
>>> https://github.com/apache/beam/blob/master/sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/meta/provider/bigquery/BigQueryTable.java#L128
>>> [5] https://github.com/apache/beam/pull/10095
>>> [6] https://github.com/apache/beam/pull/10060
>>>
>>> --
>>> Kirill
>>>
>>

Reply via email to