Hi all,

I have updated and polished a pull request  I submitted some time ago, and
I would like to bring it to the attention of this list, to see if I could
get some feedback or review of the code.

The PR is at https://github.com/apache/beam/pull/9852

It adds a new option withQueryTempDataset to BigQueryIO.Read.

Currently, if I want to read from a table with BigQueryIO, I need to assign
the role bigquery.jobUser to the service account  of Apache Beam (e.g.
Dataflow).

However, if I try to read from a view using the same role, the pipeline
will fail, because it needs to create a temporary dataset and table. The
name of this dataset is chosen by Apache Beam.

This in practice requires giving the service account the permission to
create datasets (e.g. assigning the role bigquery.user, not
bigquery.jobUser), which is a very broad permission.

With the submitted PR, you can specify the temporary dataset used to read
from queries (e.g. reading from a view). Thus you can just keep the role
bigquery.jobUser in the Beam service account, and just provide additional
permissions in that dataset to create temporary tables (confining any
potential write activity to that dataset only).

The destination dataset can even be in a different project than the data
you are reading (something that is not possible with the currently
available options), so you don't need to give write permissions in the same
project where the data resides. In situations where there is a
"untouchable" data project with authorized views, it is currently
impossible to read from those authorized views with BigQueryIO, unless you
give write permissions to Beam in the "untouchable" project. With this PR,
you could confine those writes to another project and dataset.

I hope the need for this option makes sense. Any thoughts?

Kind regards,
Israel

Reply via email to