mik-laj commented on a change in pull request #6590: [AIRFLOW-5520] Add options 
to run Dataflow in a virtual environment
URL: https://github.com/apache/airflow/pull/6590#discussion_r347112557
 
 

 ##########
 File path: airflow/gcp/hooks/dataflow.py
 ##########
 @@ -515,8 +530,20 @@ def label_formatter(labels_dict):
             return ['--labels={}={}'.format(key, value)
                     for key, value in labels_dict.items()]
 
-        self._start_dataflow(variables, name, [py_interpreter] + py_options + 
[dataflow],
-                             label_formatter, project_id)
+        if py_requirements is not None:
+            with TemporaryDirectory(prefix='dataflow-venv') as tmp_dir:
+                py_interpreter = prepare_virtualenv(
 
 Review comment:
   1. If you want to set up a private pip repository for Cloud Composer, then 
you should follow these instructions:
   
https://cloud.google.com/composer/docs/how-to/using/installing-python-dependencies#install-private
   You will also need to set the ``PIP_CONFIG_FILE`` environment variable when 
you use the virtual environment
   2. Unfortunately, in distributed systems we should not save files on the 
current environment and rely on the fact that they will be available on 
restart. This can cause many problems, e.g. taking up all the disk space.
   Instead, I recommend delivering packages from a local folder that you have 
previously set up. I have prepared a script that shows how to download all 
packages to the directory and then install them in a virtual environment
   ```bash
   #!/bin/bash -x
   
   TMP_PYPI="$(mktemp -d)"
   TMP_VENV="$(mktemp -d)"
   
   virtualenv ${TMP_PYPI}/venv
   virtualenv ${TMP_VENV}
   
   START_FETCH=$(date +%s)
   source ${TMP_VENV}/bin/activate
   mkdir -p "${TMP_PYPI}/packages"
   pip download apache-beam[gcp] -d "${TMP_PYPI}/packages"
   END_FETCH=$(date +%s)
   
   START_INSTALL=$(date +%s)
   source ${TMP_VENV}/bin/activate
   pip install $(find "${TMP_PYPI}/packages" -type f)
   END_INSTALL=$(date +%s)
   
   echo "Fetch time: $(( END_FETCH - START_FETCH )) seconds"
   echo "Install time: $(( END_INSTALL - START_INSTALL )) seconds"
   ```
   For me, the download process took a long time, but the local installation 
was significantly faster.
   ```
   Fetch time: 41 seconds
   Install time: 20 seconds
   ```

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

Reply via email to