I assume from the previous messages that GCP Dataflow is being used as the
pipeline runner.  Even without Flex Templates, the v2 runner can use docker
containers to install all dependencies from various sources[1].  I have
used docker containers to solve the same problem you mention: installing a
python dependency from a private package repository.  The process is
roughly:


   1. Build a docker container from the apache beam base images,
   customizing as you need[2]
   2. Tag and push that image to Google Container Registry
   3. When you deploy your Dataflow job, include the options
   "--experiment=use_runner_v2 --worker_harness_container_image=
   gcr.io/my-project/my-image-name:my-image-tag" (there may be other ways,
   but this is what I have seen working first-hand)

Your docker file can be as simple as:

# Python:major:minor-slim must match apache/beam_python[major:minor]_sdk
FROM python:3.10-slim

# authenticate with private python package repo, install all various
# dependencies, set env vars, COPY your pipeline code to the container, etc
#
#  ...
#
#

# Copy files from official SDK image, including script/dependencies.
# Apache SDK version must match python image major:minor version
# Based on
https://cloud.google.com/dataflow/docs/guides/using-custom-containers#python_1
COPY --from=apache/beam_python3.10_sdk:2.52.0  /opt/apache/beam
/opt/apache/beam

# Set the entrypoint to Apache Beam SDK launcher.
ENTRYPOINT ["/opt/apache/beam/boot"]

[1]
https://cloud.google.com/dataflow/docs/guides/using-custom-containers#python_1
[2]
https://cloud.google.com/dataflow/docs/guides/build-container-image#python


On Fri, Dec 22, 2023 at 6:32 AM XQ Hu via user <user@beam.apache.org> wrote:

> You can use the same docker image for both template launcher and Dataflow
> job. Here is one example:
> https://github.com/google/dataflow-ml-starter/blob/main/tensorflow_gpu.flex.Dockerfile#L60
>
> On Fri, Dec 22, 2023 at 8:04 AM Sumit Desai <sumit.de...@uplight.com>
> wrote:
>
>> Yes, I will have to try it out.
>>
>> Regards
>> Sumit Desai
>>
>> On Fri, Dec 22, 2023 at 3:53 PM Sofia’s World <mmistr...@gmail.com>
>> wrote:
>>
>>> I guess so, i am not an expert on using env variables in dataflow
>>> pipelines as any config dependencies i  need, i pass them as job input
>>> params
>>>
>>> But perhaps you can configure variables in your docker file (i am not an
>>> expert in this either),  as  flex templates use Docker?
>>>
>>>
>>> https://cloud.google.com/dataflow/docs/guides/templates/configuring-flex-templates
>>>
>>> hth
>>>   Marco
>>>
>>>
>>>
>>>
>>> On Fri, Dec 22, 2023 at 10:17 AM Sumit Desai <sumit.de...@uplight.com>
>>> wrote:
>>>
>>>> We are using an external non-public package which expects environmental
>>>> variables only. If environmental variables are not found, it will throw an
>>>> error. We can't change source of this package.
>>>>
>>>> Does this mean we will face same problem with flex templates also?
>>>>
>>>> On Fri, 22 Dec 2023, 3:39 pm Sofia’s World, <mmistr...@gmail.com>
>>>> wrote:
>>>>
>>>>> The flex template will allow you to pass input params with dynamic
>>>>> values to your data flow job so you could replace the env variable with
>>>>> that input? That is, unless you have to have env bars..but from your
>>>>> snippets it appears you are just using them to configure one of your
>>>>> components?
>>>>> Hth
>>>>>
>>>>> On Fri, 22 Dec 2023, 10:01 Sumit Desai, <sumit.de...@uplight.com>
>>>>> wrote:
>>>>>
>>>>>> Hi Sofia and XQ,
>>>>>>
>>>>>> The application is failing because I have loggers defined in every
>>>>>> file and the method to create a logger tries to create an object of
>>>>>> UplightTelemetry. If I use flex templated, will the environmental 
>>>>>> variables
>>>>>> I supply be loaded before the application gets loaded? If not, it would 
>>>>>> not
>>>>>> serve my purpose.
>>>>>>
>>>>>> Thanks & Regards,
>>>>>> Sumit Desai
>>>>>>
>>>>>> On Thu, Dec 21, 2023 at 10:02 AM Sumit Desai <sumit.de...@uplight.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Thank you HQ. Will take a look at this.
>>>>>>>
>>>>>>> Regards,
>>>>>>> Sumit Desai
>>>>>>>
>>>>>>> On Wed, Dec 20, 2023 at 8:13 PM XQ Hu <x...@google.com> wrote:
>>>>>>>
>>>>>>>> Dataflow VMs cannot know your local env variable. I think you
>>>>>>>> should use custom container:
>>>>>>>> https://cloud.google.com/dataflow/docs/guides/using-custom-containers.
>>>>>>>> Here is a sample project:
>>>>>>>> https://github.com/google/dataflow-ml-starter
>>>>>>>>
>>>>>>>> On Wed, Dec 20, 2023 at 4:57 AM Sofia’s World <mmistr...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hello Sumit
>>>>>>>>>  Thanks. Sorry...I guess if the value of the env variable is
>>>>>>>>> always the same u can pass it as job params?..though it doesn't sound 
>>>>>>>>> like
>>>>>>>>> a viable option...
>>>>>>>>> Hth
>>>>>>>>>
>>>>>>>>> On Wed, 20 Dec 2023, 09:49 Sumit Desai, <sumit.de...@uplight.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Hi Sofia,
>>>>>>>>>>
>>>>>>>>>> Thanks for the response. For now, we have decided not to use flex
>>>>>>>>>> template. Is there a way to pass environmental variables without 
>>>>>>>>>> using any
>>>>>>>>>> template?
>>>>>>>>>>
>>>>>>>>>> Thanks & Regards,
>>>>>>>>>> Sumit Desai
>>>>>>>>>>
>>>>>>>>>> On Wed, Dec 20, 2023 at 3:16 PM Sofia’s World <
>>>>>>>>>> mmistr...@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi
>>>>>>>>>>>  My 2 cents. .have u ever considered using flex templates to run
>>>>>>>>>>> your pipeline? Then you can pass all your parameters at runtime..
>>>>>>>>>>> (Apologies in advance if it does not cover your use case...)
>>>>>>>>>>>
>>>>>>>>>>> On Wed, 20 Dec 2023, 09:35 Sumit Desai via user, <
>>>>>>>>>>> user@beam.apache.org> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi all,
>>>>>>>>>>>>
>>>>>>>>>>>> I have a Python application which is using Apache beam and
>>>>>>>>>>>> Dataflow as runner. The application uses a non-public Python 
>>>>>>>>>>>> package
>>>>>>>>>>>> 'uplight-telemetry' which is configured using 'extra_packages' 
>>>>>>>>>>>> while
>>>>>>>>>>>> creating pipeline_options object. This package expects an 
>>>>>>>>>>>> environmental
>>>>>>>>>>>> variable named 'OTEL_SERVICE_NAME' and since this variable is not 
>>>>>>>>>>>> present
>>>>>>>>>>>> in the Dataflow worker, it is resulting in an error during 
>>>>>>>>>>>> application
>>>>>>>>>>>> startup.
>>>>>>>>>>>>
>>>>>>>>>>>> I am passing this variable using custom pipeline options. Code
>>>>>>>>>>>> to create pipeline options is as follows-
>>>>>>>>>>>>
>>>>>>>>>>>> pipeline_options = ProcessBillRequests.CustomOptions(
>>>>>>>>>>>>     project=gcp_project_id,
>>>>>>>>>>>>     region="us-east1",
>>>>>>>>>>>>     job_name=job_name,
>>>>>>>>>>>>     
>>>>>>>>>>>> temp_location=f'gs://{TAS_GCS_BUCKET_NAME_PREFIX}{os.getenv("UP_PLATFORM_ENV")}/temp',
>>>>>>>>>>>>     
>>>>>>>>>>>> staging_location=f'gs://{TAS_GCS_BUCKET_NAME_PREFIX}{os.getenv("UP_PLATFORM_ENV")}/staging',
>>>>>>>>>>>>     runner='DataflowRunner',
>>>>>>>>>>>>     save_main_session=True,
>>>>>>>>>>>>     service_account_email= service_account,
>>>>>>>>>>>>     subnetwork=os.environ.get(SUBNETWORK_URL),
>>>>>>>>>>>>     extra_packages=[uplight_telemetry_tar_file_path],
>>>>>>>>>>>>     setup_file=setup_file_path,
>>>>>>>>>>>>     OTEL_SERVICE_NAME=otel_service_name,
>>>>>>>>>>>>     OTEL_RESOURCE_ATTRIBUTES=otel_resource_attributes
>>>>>>>>>>>>     # Set values for additional custom variables as needed
>>>>>>>>>>>> )
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> And the code that executes the pipeline is as follows-
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> result = (
>>>>>>>>>>>>         pipeline
>>>>>>>>>>>>         | "ReadPendingRecordsFromDB" >> read_from_db
>>>>>>>>>>>>         | "Parse input PCollection" >> 
>>>>>>>>>>>> beam.Map(ProcessBillRequests.parse_bill_data_requests)
>>>>>>>>>>>>         | "Fetch bills " >> 
>>>>>>>>>>>> beam.ParDo(ProcessBillRequests.FetchBillInformation())
>>>>>>>>>>>> )
>>>>>>>>>>>>
>>>>>>>>>>>> pipeline.run().wait_until_finish()
>>>>>>>>>>>>
>>>>>>>>>>>> Is there a way I can set the environmental variables in custom
>>>>>>>>>>>> options available in the worker?
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks & Regards,
>>>>>>>>>>>> Sumit Desai
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>> On Wed, Dec 20, 2023 at 8:13 PM XQ Hu <x...@google.com> wrote:
>>>>>>
>>>>>>> Dataflow VMs cannot know your local env variable. I think you should
>>>>>>> use custom container:
>>>>>>> https://cloud.google.com/dataflow/docs/guides/using-custom-containers.
>>>>>>> Here is a sample project:
>>>>>>> https://github.com/google/dataflow-ml-starter
>>>>>>>
>>>>>>> On Wed, Dec 20, 2023 at 4:57 AM Sofia’s World <mmistr...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hello Sumit
>>>>>>>>  Thanks. Sorry...I guess if the value of the env variable is always
>>>>>>>> the same u can pass it as job params?..though it doesn't sound like a
>>>>>>>> viable option...
>>>>>>>> Hth
>>>>>>>>
>>>>>>>> On Wed, 20 Dec 2023, 09:49 Sumit Desai, <sumit.de...@uplight.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hi Sofia,
>>>>>>>>>
>>>>>>>>> Thanks for the response. For now, we have decided not to use flex
>>>>>>>>> template. Is there a way to pass environmental variables without 
>>>>>>>>> using any
>>>>>>>>> template?
>>>>>>>>>
>>>>>>>>> Thanks & Regards,
>>>>>>>>> Sumit Desai
>>>>>>>>>
>>>>>>>>> On Wed, Dec 20, 2023 at 3:16 PM Sofia’s World <mmistr...@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Hi
>>>>>>>>>>  My 2 cents. .have u ever considered using flex templates to run
>>>>>>>>>> your pipeline? Then you can pass all your parameters at runtime..
>>>>>>>>>> (Apologies in advance if it does not cover your use case...)
>>>>>>>>>>
>>>>>>>>>> On Wed, 20 Dec 2023, 09:35 Sumit Desai via user, <
>>>>>>>>>> user@beam.apache.org> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi all,
>>>>>>>>>>>
>>>>>>>>>>> I have a Python application which is using Apache beam and
>>>>>>>>>>> Dataflow as runner. The application uses a non-public Python package
>>>>>>>>>>> 'uplight-telemetry' which is configured using 'extra_packages' while
>>>>>>>>>>> creating pipeline_options object. This package expects an 
>>>>>>>>>>> environmental
>>>>>>>>>>> variable named 'OTEL_SERVICE_NAME' and since this variable is not 
>>>>>>>>>>> present
>>>>>>>>>>> in the Dataflow worker, it is resulting in an error during 
>>>>>>>>>>> application
>>>>>>>>>>> startup.
>>>>>>>>>>>
>>>>>>>>>>> I am passing this variable using custom pipeline options. Code
>>>>>>>>>>> to create pipeline options is as follows-
>>>>>>>>>>>
>>>>>>>>>>> pipeline_options = ProcessBillRequests.CustomOptions(
>>>>>>>>>>>     project=gcp_project_id,
>>>>>>>>>>>     region="us-east1",
>>>>>>>>>>>     job_name=job_name,
>>>>>>>>>>>     
>>>>>>>>>>> temp_location=f'gs://{TAS_GCS_BUCKET_NAME_PREFIX}{os.getenv("UP_PLATFORM_ENV")}/temp',
>>>>>>>>>>>     
>>>>>>>>>>> staging_location=f'gs://{TAS_GCS_BUCKET_NAME_PREFIX}{os.getenv("UP_PLATFORM_ENV")}/staging',
>>>>>>>>>>>     runner='DataflowRunner',
>>>>>>>>>>>     save_main_session=True,
>>>>>>>>>>>     service_account_email= service_account,
>>>>>>>>>>>     subnetwork=os.environ.get(SUBNETWORK_URL),
>>>>>>>>>>>     extra_packages=[uplight_telemetry_tar_file_path],
>>>>>>>>>>>     setup_file=setup_file_path,
>>>>>>>>>>>     OTEL_SERVICE_NAME=otel_service_name,
>>>>>>>>>>>     OTEL_RESOURCE_ATTRIBUTES=otel_resource_attributes
>>>>>>>>>>>     # Set values for additional custom variables as needed
>>>>>>>>>>> )
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> And the code that executes the pipeline is as follows-
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> result = (
>>>>>>>>>>>         pipeline
>>>>>>>>>>>         | "ReadPendingRecordsFromDB" >> read_from_db
>>>>>>>>>>>         | "Parse input PCollection" >> 
>>>>>>>>>>> beam.Map(ProcessBillRequests.parse_bill_data_requests)
>>>>>>>>>>>         | "Fetch bills " >> 
>>>>>>>>>>> beam.ParDo(ProcessBillRequests.FetchBillInformation())
>>>>>>>>>>> )
>>>>>>>>>>>
>>>>>>>>>>> pipeline.run().wait_until_finish()
>>>>>>>>>>>
>>>>>>>>>>> Is there a way I can set the environmental variables in custom
>>>>>>>>>>> options available in the worker?
>>>>>>>>>>>
>>>>>>>>>>> Thanks & Regards,
>>>>>>>>>>> Sumit Desai
>>>>>>>>>>>
>>>>>>>>>>

Reply via email to