Thanks for the quick responses! Mine are inline as well. On Thu, Jan 13, 2022 at 9:01 PM Brian Hulette <bhule...@google.com> wrote:
> I added some responses inline. Also adding dev@ since this is getting > into SQL internals. > > On Thu, Jan 13, 2022 at 10:29 AM Steve Niemitz <sniem...@apache.org> > wrote: > >> I've been playing around with CREATE EXTERNAL TABLE (using a custom >> TableProvider as well) w/ BeamSQL and really love it. I have a few >> questions though that I've accumulated as I've been using it I wanted to >> ask. >> >> - I'm a little confused about the need to define columns in the CREATE >> EXTERNAL TABLE statement. If I have a BeamSqlTable implementation that can >> provide the schema on its own, it seems like the columns supplied to the >> CREATE statement are ignored. This is ideal anyways, since it's infeasible >> for users to provide the entire schema up-front, especially for more >> complicated sources. Should the column list be optional here instead? >> > > Our documentation would seem to indicate that defining columns is optional > - looking at the example for BigQuery here [1] the schema is not provided. > Those docs must be aspirational though, I just checked and the > BigQueryTableProvider definitely expects the schema to be defined and uses > it [2]. > > I think it would make sense to make the column list optional- that way we > can actually fulfill our BigQuery documentation. > Big +1 to that. > Note if you're building your own custom TableProvider, you might not need > to use CREATE EXTERNAL TABLE. You could add an implementation for > TableProvider.getTable that retrieves the metadata for a given table name > and returns a Table instance that can build the necessary IOs. This is only > possible if you can retrieve all the metadata you need to construct the > IOs though. If you want users to be able to configure it further (one > example might be specifying the read mode for BigQuery), this won't work. > > [1] > https://beam.apache.org/documentation/dsls/sql/extensions/create-external-table/#bigquery > [2] > https://github.com/apache/beam/blob/872455570ae7f3e2e35360bccf93b503ae9fdb5c/sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/meta/provider/bigquery/BigQueryTable.java#L82 > Maybe I'm looking at the wrong thing? Both those examples show a column list, and the BNF (or whatever) syntax implies at least one "table element" must be present. But yeah, this is basically what I'm doing right now. I just return the "real" schema in BeamSqlTable.getSchema and ignore whatever was passed in. It seems to work correctly. Ideally the column list would be optional here, as you alluded to above. It'll be clunky explaining to users something like "just include any random column list, we'll ignore it". > - It seems like predicate pushdown only works if the schema is "flat" (has >> no nested rows). I understand the complication in pushing down more >> complicated nested predicates, however, assuming the table implementation >> doesn't actually attempt to push them down, it seems like it would be fine >> to allow? >> > > Do we have this limitation? I think predicate pushdown will work > with predicates on nested fields. The table is presented with a list of > RexNodes representing separable predicates, an individual predicate could > add a filter on a nested column IIUC. > > We may have the limitation that project pushdown won't work on nested rows > though, since the API just takes a list of field names. It's possible we > handle this by passing a joined name (e.g. foo.bar.baz), but I bet not. The > design doc [3] does have a note saying "no nested tables for now". > > [3] > https://docs.google.com/document/d/1-ysD7U7qF3MAmSfkbXZO_5PLJBevAL9bktlLCerd_jE/edit > > BeamIOPushDownRule short circuits on nested fields [1], I can also verify this just by the fact that my constructFilter method isn't called when my schema contains a nested row. [1] https://github.com/apache/beam/blob/v2.35.0/sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/impl/rule/BeamIOPushDownRule.java#L89 > >> - As a follow up on the above, I'd like to expose a "virtual" field in my >> schema that represents the partition the data has come from. For example >> BigQuery has a similar concept called _PARTITIONTIME. This would be picked >> up by the predicate pushdown and used to filter the partitions being read. >> I can't really figure out how I'd construct something similar here, even if >> pushdown worked in all cases. For example, for this query: >> >> SELECT * from table >> where _PARTITIONTIME between X and Y >> >> I'd want that filter to be pushed down to my IO, but also the >> _PARTITIONTIME column wouldn't be returned in the select list. I was >> hoping to use BigQueryIO as an example of how to do this, but it doesn't >> seem like it exposes the virtual _PARTITIONTIME column either. >> > > Yeah I think this will be hard to do with our current abstractions. You > may be able to do it if you're ok with actually populating a _PARTITIONTIME > column though. > I can play around with something like that, it'd be a little weird to have it in the result schema but not the end of the world. >