nikitagrover19 commented on issue #36988: URL: https://github.com/apache/beam/issues/36988#issuecomment-4749041566
Hey, looked into this more. Turns out the fix needs two files, not one. `yaml_io.py`'s `read_from_bigquery()` hardcodes `output_type='BEAM_ROW'` for every YAML read, and there's no `schema` param exposed at all. So even if we fix `ReadFromBigQuery` itself, YAML users still couldn't pass a schema through without touching this file too. Here's what I'm thinking: - In `bigquery.py`, add a `query_output_schema` param to `ReadFromBigQuery`. When someone sets `query` + `BEAM_ROW` + this schema, skip the `get_table()` lookup and build the row type directly from what they passed in. `convert_to_usertype()` already accepts any `TableSchema`-shaped object, so that part needs no changes. No schema still throws the existing error. - In `yaml_io.py`, add a `schema` param to `read_from_bigquery()` (same pattern `read_from_pubsub` already uses), require it when `query` is set, and pass it through. One thing I'm not 100% sure on - should the YAML schema be a BigQuery-style field list, or something more generic like JSON schema? Leaning toward BigQuery-style since it's a BigQuery transform, but open to being told otherwise. I'd like to take this one if it sounds reasonable - happy to put up a PR. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
