[GitHub] [airflow] mik-laj commented on a change in pull request #6590: [AIRFLOW-5520] Add options to run Dataflow in a virtual environment

2019-11-18 Thread GitBox
mik-laj commented on a change in pull request #6590: [AIRFLOW-5520] Add options 
to run Dataflow in a virtual environment
URL: https://github.com/apache/airflow/pull/6590#discussion_r347680244
 
 

 ##
 File path: airflow/gcp/hooks/dataflow.py
 ##
 @@ -481,30 +483,43 @@ def start_python_dataflow(
 variables: Dict,
 dataflow: str,
 py_options: List[str],
+py_interpreter: str = "python2",
+py_requirements: Optional[List[str]] = None,
+py_system_site_packages: bool = False,
 project_id: Optional[str] = None,
 append_job_name: bool = True,
-py_interpreter: str = "python2"
 ):
 """
 Starts Dataflow job.
 
 :param job_name: The name of the job.
 :type job_name: str
 :param variables: Variables passed to the job.
-:type variables: dict
+:type variables: Dict
 :param dataflow: Name of the Dataflow process.
 :type dataflow: str
 :param py_options: Additional options.
-:type py_options: list
-:param append_job_name: True if unique suffix has to be appended to 
job name.
-:type append_job_name: bool
-:param project_id: Optional, the GCP project ID in which to start a 
job.
-If set to None or missing, the default project_id from the GCP 
connection is used.
+:type py_options: List[str]
 :param py_interpreter: Python version of the beam pipeline.
 If None, this defaults to the python2.
 To track python versions supported by beam and related
 issues check: https://issues.apache.org/jira/browse/BEAM-1251
+:param py_requirements: Additional python package to install.
+If a value is passed to this parameter, a new virtual environment 
has been created with
+additional packages installed.
+
+You could also install the apache-beam package if it is not 
installed on your system or you want
+to use a different version.
+:type py_requirements: List[str]
+:param py_system_site_packages: Whether to include 
system_site_packages in your virtualenv.
+See virtualenv documentation for more information.
+
+This option is only relevant if the AAA parameter is passed.
 
 Review comment:
   Fixed. Thanks AAA=py_requirements.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [airflow] mik-laj commented on a change in pull request #6590: [AIRFLOW-5520] Add options to run Dataflow in a virtual environment

2019-11-16 Thread GitBox
mik-laj commented on a change in pull request #6590: [AIRFLOW-5520] Add options 
to run Dataflow in a virtual environment
URL: https://github.com/apache/airflow/pull/6590#discussion_r347112557
 
 

 ##
 File path: airflow/gcp/hooks/dataflow.py
 ##
 @@ -515,8 +530,20 @@ def label_formatter(labels_dict):
 return ['--labels={}={}'.format(key, value)
 for key, value in labels_dict.items()]
 
-self._start_dataflow(variables, name, [py_interpreter] + py_options + 
[dataflow],
- label_formatter, project_id)
+if py_requirements is not None:
+with TemporaryDirectory(prefix='dataflow-venv') as tmp_dir:
+py_interpreter = prepare_virtualenv(
 
 Review comment:
   1. If you want to set up a private pip repository for Cloud Composer, then 
you should follow these instructions:
   
https://cloud.google.com/composer/docs/how-to/using/installing-python-dependencies#install-private
   You will also need to set the ``PIP_CONFIG_FILE`` environment variable when 
you use the virtual environment
   2. Unfortunately, in distributed systems we should not save files on the 
current environment and rely on the fact that they will be available on 
restart. This can cause many problems, e.g. taking up all the disk space.
   Instead, I recommend delivering packages from a local folder that you have 
previously set up. I have prepared a script that shows how to download all 
packages to the directory and then install them in a virtual environment
   ```bash
   #!/bin/bash -x
   
   TMP_PYPI="$(mktemp -d)"
   TMP_VENV="$(mktemp -d)"
   
   virtualenv ${TMP_PYPI}/venv
   virtualenv ${TMP_VENV}
   
   START_FETCH=$(date +%s)
   source ${TMP_VENV}/bin/activate
   mkdir -p "${TMP_PYPI}/packages"
   pip download apache-beam[gcp] -d "${TMP_PYPI}/packages"
   END_FETCH=$(date +%s)
   
   START_INSTALL=$(date +%s)
   source ${TMP_VENV}/bin/activate
   pip install $(find "${TMP_PYPI}/packages" -type f)
   END_INSTALL=$(date +%s)
   
   echo "Fetch time: $(( END_FETCH - START_FETCH )) seconds"
   echo "Install time: $(( END_INSTALL - START_INSTALL )) seconds"
   ```
   For me, the download process took a long time, but the local installation 
was significantly faster.
   ```
   Fetch time: 41 seconds
   Install time: 20 seconds
   ```
   It is worth remembering that just starting the workers takes a few minutes, 
so in this case the differences of a few seconds will not matter much.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [airflow] mik-laj commented on a change in pull request #6590: [AIRFLOW-5520] Add options to run Dataflow in a virtual environment

2019-11-16 Thread GitBox
mik-laj commented on a change in pull request #6590: [AIRFLOW-5520] Add options 
to run Dataflow in a virtual environment
URL: https://github.com/apache/airflow/pull/6590#discussion_r347112557
 
 

 ##
 File path: airflow/gcp/hooks/dataflow.py
 ##
 @@ -515,8 +530,20 @@ def label_formatter(labels_dict):
 return ['--labels={}={}'.format(key, value)
 for key, value in labels_dict.items()]
 
-self._start_dataflow(variables, name, [py_interpreter] + py_options + 
[dataflow],
- label_formatter, project_id)
+if py_requirements is not None:
+with TemporaryDirectory(prefix='dataflow-venv') as tmp_dir:
+py_interpreter = prepare_virtualenv(
 
 Review comment:
   1. If you want to set up a private pip repository for Cloud Composer, then 
you should follow these instructions:
   
https://cloud.google.com/composer/docs/how-to/using/installing-python-dependencies#install-private
   You will also need to set the ``PIP_CONFIG_FILE`` environment variable when 
you use the virtual environment
   2. Unfortunately, in distributed systems we should not save files on the 
current environment and rely on the fact that they will be available on 
restart. This can cause many problems, e.g. taking up all the disk space.
   Instead, I recommend delivering packages from a local folder that you have 
previously set up. I have prepared a script that shows how to download all 
packages to the directory and then install them in a virtual environment
   ```bash
   #!/bin/bash -x
   
   TMP_PYPI="$(mktemp -d)"
   TMP_VENV="$(mktemp -d)"
   
   virtualenv ${TMP_PYPI}/venv
   virtualenv ${TMP_VENV}
   
   START_FETCH=$(date +%s)
   source ${TMP_VENV}/bin/activate
   mkdir -p "${TMP_PYPI}/packages"
   pip download apache-beam[gcp] -d "${TMP_PYPI}/packages"
   END_FETCH=$(date +%s)
   
   START_INSTALL=$(date +%s)
   source ${TMP_VENV}/bin/activate
   pip install $(find "${TMP_PYPI}/packages" -type f)
   END_INSTALL=$(date +%s)
   
   echo "Fetch time: $(( END_FETCH - START_FETCH )) seconds"
   echo "Install time: $(( END_INSTALL - START_INSTALL )) seconds"
   ```
   For me, the download process took a long time, but the local installation 
was significantly faster.
   ```
   Fetch time: 41 seconds
   Install time: 20 seconds
   ```


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services