Hi all, I have updated and polished a pull request I submitted some time ago, and I would like to bring it to the attention of this list, to see if I could get some feedback or review of the code.
The PR is at https://github.com/apache/beam/pull/9852 It adds a new option withQueryTempDataset to BigQueryIO.Read. Currently, if I want to read from a table with BigQueryIO, I need to assign the role bigquery.jobUser to the service account of Apache Beam (e.g. Dataflow). However, if I try to read from a view using the same role, the pipeline will fail, because it needs to create a temporary dataset and table. The name of this dataset is chosen by Apache Beam. This in practice requires giving the service account the permission to create datasets (e.g. assigning the role bigquery.user, not bigquery.jobUser), which is a very broad permission. With the submitted PR, you can specify the temporary dataset used to read from queries (e.g. reading from a view). Thus you can just keep the role bigquery.jobUser in the Beam service account, and just provide additional permissions in that dataset to create temporary tables (confining any potential write activity to that dataset only). The destination dataset can even be in a different project than the data you are reading (something that is not possible with the currently available options), so you don't need to give write permissions in the same project where the data resides. In situations where there is a "untouchable" data project with authorized views, it is currently impossible to read from those authorized views with BigQueryIO, unless you give write permissions to Beam in the "untouchable" project. With this PR, you could confine those writes to another project and dataset. I hope the need for this option makes sense. Any thoughts? Kind regards, Israel