Re: Specifying dataflow template location with Apache beam Python SDK

2023-12-18 Thread Sumit Desai via user
Thanks all. Yes I was under a misunderstanding that we can directly use one
of these templates as a base without creating a custom template. Thanks for
clarifying it for me.

Regards,
Sumit Desai

On Mon, 18 Dec 2023, 10:34 pm Bruno Volpato via user, 
wrote:

> Right, there's some misunderstanding here, so Bartosz and XQ's inputs are
> correct.
>
> Just want to add that the template_location parameter is the GCS path that
> you want to store your template on, and not the image reference of the base
> image.
> The GCR path that you are trying to use is used in the Dockerfile in case
> you are trying to use a Flex Template (see here:
> https://cloud.google.com/dataflow/docs/guides/templates/configuring-flex-templates#use_custom_container_images
> ).
>
> Best,
> Bruno
>
>
>
>
> On Mon, Dec 18, 2023 at 11:39 AM XQ Hu via user 
> wrote:
>
>>
>> https://github.com/google/dataflow-ml-starter/tree/main?tab=readme-ov-file#run-the-beam-pipeline-with-dataflow-flex-templates
>> has a full example about how to create your own flex template. FYI.
>>
>> On Mon, Dec 18, 2023 at 5:01 AM Bartosz Zabłocki via user <
>> user@beam.apache.org> wrote:
>>
>>> Hi Sumit,
>>> could you elaborate a little bit more on what you are trying to achieve
>>> with the templates?
>>>
>>> As far as I know, these base Docker images serve as base images for your
>>> own custom templates.
>>> If you want to use an existing template, you can use one of these:
>>> https://cloud.google.com/dataflow/docs/guides/templates/provided-templates
>>> .
>>> To run it, you just need to invoke `gcloud dataflow jobs run... ` or
>>> equivalent command (
>>> https://cloud.google.com/dataflow/docs/guides/templates/provided/pubsub-to-pubsub#gcloud).
>>> Or just use the UI to launch it (Cloud Console -> Dataflow -> Jobs ->
>>> Create Job From Template).
>>>
>>> If you want to create your own template (ie a reusable Dataflow
>>> pipeline) take a look at this page:
>>> https://cloud.google.com/dataflow/docs/guides/templates/using-flex-templates#create_a_flex_template.
>>> This will let you package your own pipeline as a template. You'll be able
>>> to launch it with the `gcloud dataflow jobs run...` command.
>>> If you want to create a custom container image, which gives you more
>>> control over the environment and dependencies, you can create your own,
>>> custom Docker image. That's where you'll use the base image you mentioned.
>>> See this page for an example:
>>> https://cloud.google.com/dataflow/docs/guides/templates/configuring-flex-templates#use_a_custom_container_for_dependencies
>>> .
>>>
>>> I hope this helps, let me know if you have any other questions.
>>>
>>> Cheers,
>>> Bartosz Zablocki
>>>
>>> On Mon, Dec 18, 2023 at 8:36 AM Sumit Desai via user <
>>> user@beam.apache.org> wrote:
>>>
 I am creating an Apache beam pipeline using Python SDK.I want to use
 some standard template of dataflow (this one
 ).
 But when I am specifying it using 'template_location' key while creating
 pipeline_options object, I am getting an error `FileNotFoundError: [Errno
 2] No such file or directory: '
 gcr.io/dataflow-templates-base/python310-template-launcher-base'`
 

 I also tried to specify the complete version `
 gcr.io/dataflow-templates-base/python310-template-launcher-base::flex_templates_base_image_release_20231127_RC00`
 
 but got the same error. Can someone suggest what I might be doing wrong?
 The code snippet to create pipeline_options is as follows-

 def __create_pipeline_options_dataflow(job_name):


 # Set up the Dataflow runner options
 gcp_project_id = os.environ.get(GCP_PROJECT_ID)
 # TODO:Move to environmental variables
 pipeline_options = {
 'project': gcp_project_id,
 'region': "us-east1",
 'job_name': job_name,  # Provide a unique job name
 'temp_location':
 f'gs://{TAS_GCS_BUCKET_NAME_PREFIX}{os.getenv("UP_PLATFORM_ENV")}/temp',
 'staging_location':
 f'gs://{TAS_GCS_BUCKET_NAME_PREFIX}{os.getenv("UP_PLATFORM_ENV")}/staging',
 'runner': 'DataflowRunner',
 'save_main_session': True,
 'service_account_email': service_account,
 # 'network':
 f'projects/{gcp_project_id}/global/networks/default',
 # 'subnetwork':
 f'projects/{gcp_project_id}/regions/us-east1/subnetworks/default'
 'template_location': '
 gcr.io/dataflow-templates-base/python310-template-launcher-base'

 }
 logger.debug(f"pipeline_options created as {pipeline_options}")
 return pipeline_options



Re: Specifying dataflow template location with Apache beam Python SDK

2023-12-18 Thread Bruno Volpato via user
Right, there's some misunderstanding here, so Bartosz and XQ's inputs are
correct.

Just want to add that the template_location parameter is the GCS path that
you want to store your template on, and not the image reference of the base
image.
The GCR path that you are trying to use is used in the Dockerfile in case
you are trying to use a Flex Template (see here:
https://cloud.google.com/dataflow/docs/guides/templates/configuring-flex-templates#use_custom_container_images
).

Best,
Bruno




On Mon, Dec 18, 2023 at 11:39 AM XQ Hu via user 
wrote:

>
> https://github.com/google/dataflow-ml-starter/tree/main?tab=readme-ov-file#run-the-beam-pipeline-with-dataflow-flex-templates
> has a full example about how to create your own flex template. FYI.
>
> On Mon, Dec 18, 2023 at 5:01 AM Bartosz Zabłocki via user <
> user@beam.apache.org> wrote:
>
>> Hi Sumit,
>> could you elaborate a little bit more on what you are trying to achieve
>> with the templates?
>>
>> As far as I know, these base Docker images serve as base images for your
>> own custom templates.
>> If you want to use an existing template, you can use one of these:
>> https://cloud.google.com/dataflow/docs/guides/templates/provided-templates
>> .
>> To run it, you just need to invoke `gcloud dataflow jobs run... ` or
>> equivalent command (
>> https://cloud.google.com/dataflow/docs/guides/templates/provided/pubsub-to-pubsub#gcloud).
>> Or just use the UI to launch it (Cloud Console -> Dataflow -> Jobs ->
>> Create Job From Template).
>>
>> If you want to create your own template (ie a reusable Dataflow pipeline)
>> take a look at this page:
>> https://cloud.google.com/dataflow/docs/guides/templates/using-flex-templates#create_a_flex_template.
>> This will let you package your own pipeline as a template. You'll be able
>> to launch it with the `gcloud dataflow jobs run...` command.
>> If you want to create a custom container image, which gives you more
>> control over the environment and dependencies, you can create your own,
>> custom Docker image. That's where you'll use the base image you mentioned.
>> See this page for an example:
>> https://cloud.google.com/dataflow/docs/guides/templates/configuring-flex-templates#use_a_custom_container_for_dependencies
>> .
>>
>> I hope this helps, let me know if you have any other questions.
>>
>> Cheers,
>> Bartosz Zablocki
>>
>> On Mon, Dec 18, 2023 at 8:36 AM Sumit Desai via user <
>> user@beam.apache.org> wrote:
>>
>>> I am creating an Apache beam pipeline using Python SDK.I want to use
>>> some standard template of dataflow (this one
>>> ).
>>> But when I am specifying it using 'template_location' key while creating
>>> pipeline_options object, I am getting an error `FileNotFoundError: [Errno
>>> 2] No such file or directory: '
>>> gcr.io/dataflow-templates-base/python310-template-launcher-base'`
>>> 
>>>
>>> I also tried to specify the complete version `
>>> gcr.io/dataflow-templates-base/python310-template-launcher-base::flex_templates_base_image_release_20231127_RC00`
>>> 
>>> but got the same error. Can someone suggest what I might be doing wrong?
>>> The code snippet to create pipeline_options is as follows-
>>>
>>> def __create_pipeline_options_dataflow(job_name):
>>>
>>>
>>> # Set up the Dataflow runner options
>>> gcp_project_id = os.environ.get(GCP_PROJECT_ID)
>>> # TODO:Move to environmental variables
>>> pipeline_options = {
>>> 'project': gcp_project_id,
>>> 'region': "us-east1",
>>> 'job_name': job_name,  # Provide a unique job name
>>> 'temp_location':
>>> f'gs://{TAS_GCS_BUCKET_NAME_PREFIX}{os.getenv("UP_PLATFORM_ENV")}/temp',
>>> 'staging_location':
>>> f'gs://{TAS_GCS_BUCKET_NAME_PREFIX}{os.getenv("UP_PLATFORM_ENV")}/staging',
>>> 'runner': 'DataflowRunner',
>>> 'save_main_session': True,
>>> 'service_account_email': service_account,
>>> # 'network':
>>> f'projects/{gcp_project_id}/global/networks/default',
>>> # 'subnetwork':
>>> f'projects/{gcp_project_id}/regions/us-east1/subnetworks/default'
>>> 'template_location': '
>>> gcr.io/dataflow-templates-base/python310-template-launcher-base'
>>>
>>> }
>>> logger.debug(f"pipeline_options created as {pipeline_options}")
>>> return pipeline_options
>>>
>>>
>>>


Re: Specifying dataflow template location with Apache beam Python SDK

2023-12-18 Thread XQ Hu via user
https://github.com/google/dataflow-ml-starter/tree/main?tab=readme-ov-file#run-the-beam-pipeline-with-dataflow-flex-templates
has a full example about how to create your own flex template. FYI.

On Mon, Dec 18, 2023 at 5:01 AM Bartosz Zabłocki via user <
user@beam.apache.org> wrote:

> Hi Sumit,
> could you elaborate a little bit more on what you are trying to achieve
> with the templates?
>
> As far as I know, these base Docker images serve as base images for your
> own custom templates.
> If you want to use an existing template, you can use one of these:
> https://cloud.google.com/dataflow/docs/guides/templates/provided-templates
> .
> To run it, you just need to invoke `gcloud dataflow jobs run... ` or
> equivalent command (
> https://cloud.google.com/dataflow/docs/guides/templates/provided/pubsub-to-pubsub#gcloud).
> Or just use the UI to launch it (Cloud Console -> Dataflow -> Jobs ->
> Create Job From Template).
>
> If you want to create your own template (ie a reusable Dataflow pipeline)
> take a look at this page:
> https://cloud.google.com/dataflow/docs/guides/templates/using-flex-templates#create_a_flex_template.
> This will let you package your own pipeline as a template. You'll be able
> to launch it with the `gcloud dataflow jobs run...` command.
> If you want to create a custom container image, which gives you more
> control over the environment and dependencies, you can create your own,
> custom Docker image. That's where you'll use the base image you mentioned.
> See this page for an example:
> https://cloud.google.com/dataflow/docs/guides/templates/configuring-flex-templates#use_a_custom_container_for_dependencies
> .
>
> I hope this helps, let me know if you have any other questions.
>
> Cheers,
> Bartosz Zablocki
>
> On Mon, Dec 18, 2023 at 8:36 AM Sumit Desai via user 
> wrote:
>
>> I am creating an Apache beam pipeline using Python SDK.I want to use some
>> standard template of dataflow (this one
>> ).
>> But when I am specifying it using 'template_location' key while creating
>> pipeline_options object, I am getting an error `FileNotFoundError: [Errno
>> 2] No such file or directory: '
>> gcr.io/dataflow-templates-base/python310-template-launcher-base'`
>> 
>>
>> I also tried to specify the complete version `
>> gcr.io/dataflow-templates-base/python310-template-launcher-base::flex_templates_base_image_release_20231127_RC00`
>> 
>> but got the same error. Can someone suggest what I might be doing wrong?
>> The code snippet to create pipeline_options is as follows-
>>
>> def __create_pipeline_options_dataflow(job_name):
>>
>>
>> # Set up the Dataflow runner options
>> gcp_project_id = os.environ.get(GCP_PROJECT_ID)
>> # TODO:Move to environmental variables
>> pipeline_options = {
>> 'project': gcp_project_id,
>> 'region': "us-east1",
>> 'job_name': job_name,  # Provide a unique job name
>> 'temp_location':
>> f'gs://{TAS_GCS_BUCKET_NAME_PREFIX}{os.getenv("UP_PLATFORM_ENV")}/temp',
>> 'staging_location':
>> f'gs://{TAS_GCS_BUCKET_NAME_PREFIX}{os.getenv("UP_PLATFORM_ENV")}/staging',
>> 'runner': 'DataflowRunner',
>> 'save_main_session': True,
>> 'service_account_email': service_account,
>> # 'network': f'projects/{gcp_project_id}/global/networks/default',
>> # 'subnetwork':
>> f'projects/{gcp_project_id}/regions/us-east1/subnetworks/default'
>> 'template_location': '
>> gcr.io/dataflow-templates-base/python310-template-launcher-base'
>>
>> }
>> logger.debug(f"pipeline_options created as {pipeline_options}")
>> return pipeline_options
>>
>>
>>


Re: Specifying dataflow template location with Apache beam Python SDK

2023-12-18 Thread Bartosz Zabłocki via user
Hi Sumit,
could you elaborate a little bit more on what you are trying to achieve
with the templates?

As far as I know, these base Docker images serve as base images for your
own custom templates.
If you want to use an existing template, you can use one of these:
https://cloud.google.com/dataflow/docs/guides/templates/provided-templates.
To run it, you just need to invoke `gcloud dataflow jobs run... ` or
equivalent command (
https://cloud.google.com/dataflow/docs/guides/templates/provided/pubsub-to-pubsub#gcloud).
Or just use the UI to launch it (Cloud Console -> Dataflow -> Jobs ->
Create Job From Template).

If you want to create your own template (ie a reusable Dataflow pipeline)
take a look at this page:
https://cloud.google.com/dataflow/docs/guides/templates/using-flex-templates#create_a_flex_template.
This will let you package your own pipeline as a template. You'll be able
to launch it with the `gcloud dataflow jobs run...` command.
If you want to create a custom container image, which gives you more
control over the environment and dependencies, you can create your own,
custom Docker image. That's where you'll use the base image you mentioned.
See this page for an example:
https://cloud.google.com/dataflow/docs/guides/templates/configuring-flex-templates#use_a_custom_container_for_dependencies
.

I hope this helps, let me know if you have any other questions.

Cheers,
Bartosz Zablocki

On Mon, Dec 18, 2023 at 8:36 AM Sumit Desai via user 
wrote:

> I am creating an Apache beam pipeline using Python SDK.I want to use some
> standard template of dataflow (this one
> ).
> But when I am specifying it using 'template_location' key while creating
> pipeline_options object, I am getting an error `FileNotFoundError: [Errno
> 2] No such file or directory: '
> gcr.io/dataflow-templates-base/python310-template-launcher-base'`
> 
>
> I also tried to specify the complete version `
> gcr.io/dataflow-templates-base/python310-template-launcher-base::flex_templates_base_image_release_20231127_RC00`
> 
> but got the same error. Can someone suggest what I might be doing wrong?
> The code snippet to create pipeline_options is as follows-
>
> def __create_pipeline_options_dataflow(job_name):
>
>
> # Set up the Dataflow runner options
> gcp_project_id = os.environ.get(GCP_PROJECT_ID)
> # TODO:Move to environmental variables
> pipeline_options = {
> 'project': gcp_project_id,
> 'region': "us-east1",
> 'job_name': job_name,  # Provide a unique job name
> 'temp_location':
> f'gs://{TAS_GCS_BUCKET_NAME_PREFIX}{os.getenv("UP_PLATFORM_ENV")}/temp',
> 'staging_location':
> f'gs://{TAS_GCS_BUCKET_NAME_PREFIX}{os.getenv("UP_PLATFORM_ENV")}/staging',
> 'runner': 'DataflowRunner',
> 'save_main_session': True,
> 'service_account_email': service_account,
> # 'network': f'projects/{gcp_project_id}/global/networks/default',
> # 'subnetwork':
> f'projects/{gcp_project_id}/regions/us-east1/subnetworks/default'
> 'template_location': '
> gcr.io/dataflow-templates-base/python310-template-launcher-base'
>
> }
> logger.debug(f"pipeline_options created as {pipeline_options}")
> return pipeline_options
>
>
>


Specifying dataflow template location with Apache beam Python SDK

2023-12-17 Thread Sumit Desai via user
I am creating an Apache beam pipeline using Python SDK.I want to use some
standard template of dataflow (this one
).
But when I am specifying it using 'template_location' key while creating
pipeline_options object, I am getting an error `FileNotFoundError: [Errno
2] No such file or directory: '
gcr.io/dataflow-templates-base/python310-template-launcher-base'`

I also tried to specify the complete version `
gcr.io/dataflow-templates-base/python310-template-launcher-base::flex_templates_base_image_release_20231127_RC00`
but got the same error. Can someone suggest what I might be doing wrong?
The code snippet to create pipeline_options is as follows-

def __create_pipeline_options_dataflow(job_name):


# Set up the Dataflow runner options
gcp_project_id = os.environ.get(GCP_PROJECT_ID)
# TODO:Move to environmental variables
pipeline_options = {
'project': gcp_project_id,
'region': "us-east1",
'job_name': job_name,  # Provide a unique job name
'temp_location':
f'gs://{TAS_GCS_BUCKET_NAME_PREFIX}{os.getenv("UP_PLATFORM_ENV")}/temp',
'staging_location':
f'gs://{TAS_GCS_BUCKET_NAME_PREFIX}{os.getenv("UP_PLATFORM_ENV")}/staging',
'runner': 'DataflowRunner',
'save_main_session': True,
'service_account_email': service_account,
# 'network': f'projects/{gcp_project_id}/global/networks/default',
# 'subnetwork':
f'projects/{gcp_project_id}/regions/us-east1/subnetworks/default'
'template_location': '
gcr.io/dataflow-templates-base/python310-template-launcher-base'

}
logger.debug(f"pipeline_options created as {pipeline_options}")
return pipeline_options