Hello folks,

My team and I are interested in the possibility of contributing a
BigQueryDataFramesOperator / BigFramesOperator that would take some Python
code, similar to the PythonOperator or PythonVirtualenvOperator. The idea
is that I could automatically set some best practices before running the
supplied code, as demonstrated by my blog post here:
https://medium.com/google-cloud/creating-a-production-ready-data-pipeline-with-apache-airflow-and-bigframes-bead7d7d164b
. I have a couple of questions before I begin implementation:

   1. BigFrames is a large-ish package with a pretty big dependency tree.
   I'm wary of having an operator depend directly on it. Is there a way other
   folks have avoided this?
   2. One idea for isolating dependencies is the virtualenv operator. Would
   it be acceptable to have an operator that wraps a PythonVirtualenvOperator?
   Or subclasses? If so, which would be preferred?
   3. Another feature I'd like to make sure I handle is getting credentials
   using the BigQueryHook. There are a lot of complex auth scenarios, such as
   impersonation_scopes (
   
https://airflow.apache.org/docs/apache-airflow-providers-google/stable/_api/airflow/providers/google/cloud/hooks/bigquery/index.html#airflow.providers.google.cloud.hooks.bigquery.BigQueryHook.impersonation_scopes)
   that I want to make sure we handle.

One idea we are bouncing around is using get_python_source() (
https://github.com/apache/airflow/blob/1c5aa24ccd63b5a5052eaebc52959c7a20fc298a/providers/standard/src/airflow/providers/standard/operators/python.py#L493)
and injecting the custom initialization code after the function definition.

I'd love to hear your thoughts.

*  •  **Tim Sweña (Swast)*
*  •  *Team Lead, BigQuery DataFrames
*  •  *Google Cloud Platform
*  •  *Chicago, IL, USA

Reply via email to