mik-laj commented on a change in pull request #6590: [AIRFLOW-5520] Add options to run Dataflow in a virtual environment URL: https://github.com/apache/airflow/pull/6590#discussion_r347112557
########## File path: airflow/gcp/hooks/dataflow.py ########## @@ -515,8 +530,20 @@ def label_formatter(labels_dict): return ['--labels={}={}'.format(key, value) for key, value in labels_dict.items()] - self._start_dataflow(variables, name, [py_interpreter] + py_options + [dataflow], - label_formatter, project_id) + if py_requirements is not None: + with TemporaryDirectory(prefix='dataflow-venv') as tmp_dir: + py_interpreter = prepare_virtualenv( Review comment: 1. If you want to set up a private pip repository for Cloud Composer, then you should follow these instructions: https://cloud.google.com/composer/docs/how-to/using/installing-python-dependencies#install-private You will also need to set the ``PIP_CONFIG_FILE`` environment variable when you use the virtual environment 2. Unfortunately, in distributed systems we should not save files on the current environment and rely on the fact that they will be available on restart. This can cause many problems, e.g. taking up all the disk space. Instead, I recommend delivering packages from a local folder that you have previously set up. I have prepared a script that shows how to download all packages to the directory and then install them in a virtual environment ```bash #!/bin/bash -x TMP_PYPI="$(mktemp -d)" TMP_VENV="$(mktemp -d)" virtualenv ${TMP_PYPI}/venv virtualenv ${TMP_VENV} START_FETCH=$(date +%s) source ${TMP_VENV}/bin/activate mkdir -p "${TMP_PYPI}/packages" pip download apache-beam[gcp] -d "${TMP_PYPI}/packages" END_FETCH=$(date +%s) START_INSTALL=$(date +%s) source ${TMP_VENV}/bin/activate pip install $(find "${TMP_PYPI}/packages" -type f) END_INSTALL=$(date +%s) echo "Fetch time: $(( END_FETCH - START_FETCH )) seconds" echo "Install time: $(( END_INSTALL - START_INSTALL )) seconds" ``` For me, the download process took a long time, but the local installation was significantly faster. ``` Fetch time: 41 seconds Install time: 20 seconds ``` It is worth remembering that just starting the workers takes a few minutes, so in this case the differences of a few seconds will not matter much. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services