nikitagrover19 commented on issue #36988:
URL: https://github.com/apache/beam/issues/36988#issuecomment-4749041566

   Hey, looked into this more. Turns out the fix needs two files, not one.
   
   `yaml_io.py`'s `read_from_bigquery()` hardcodes `output_type='BEAM_ROW'` for 
every YAML read, and there's no `schema` param exposed at all. So even if we 
fix `ReadFromBigQuery` itself, YAML users still couldn't pass a schema through 
without touching this file too.
   
   Here's what I'm thinking:
   - In `bigquery.py`, add a `query_output_schema` param to `ReadFromBigQuery`. 
When someone sets `query` + `BEAM_ROW` + this schema, skip the `get_table()` 
lookup and build the row type directly from what they passed in. 
`convert_to_usertype()` already accepts any `TableSchema`-shaped object, so 
that part needs no changes. No schema still throws the existing error.
   - In `yaml_io.py`, add a `schema` param to `read_from_bigquery()` (same 
pattern `read_from_pubsub` already uses), require it when `query` is set, and 
pass it through.
   
   One thing I'm not 100% sure on - should the YAML schema be a BigQuery-style 
field list, or something more generic like JSON schema? Leaning toward 
BigQuery-style since it's a BigQuery transform, but open to being told 
otherwise.
   
   I'd like to take this one if it sounds reasonable - happy to put up a PR.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to