Re: Dataflow not able to find a module specified using extra_package

2023-12-19 Thread Sumit Desai via user
Thanks Anand and Robert. Using extra_packages and specifying it as list
worked.

Regards,
Sumit Desai

On Tue, Dec 19, 2023 at 11:45 PM Robert Bradshaw via user <
user@beam.apache.org> wrote:

> And should it be a list of strings, rather than a string?
>
> On Tue, Dec 19, 2023 at 10:10 AM Anand Inguva via user <
> user@beam.apache.org> wrote:
>
>> Can you try passing `extra_packages` instead of `extra_package` when
>> passing pipeline options as a dict?
>>
>> On Tue, Dec 19, 2023 at 12:26 PM Sumit Desai via user <
>> user@beam.apache.org> wrote:
>>
>>> Hi all,
>>> I have created a Dataflow pipeline in batch mode using Apache beam
>>> Python SDK. I am using one non-public dependency 'uplight-telemetry'. I
>>> have specified it using parameter extra_package while creating
>>> pipeline_options object. However, the pipeline loading is failing with an
>>> error *No module named 'uplight_telemetry'*.
>>> The code to create pipeline_options is as following-
>>>
>>> def __create_pipeline_options_dataflow(job_name):
>>> # Set up the Dataflow runner options
>>> gcp_project_id = os.environ.get(GCP_PROJECT_ID)
>>> current_dir = os.path.dirname(os.path.abspath(__file__))
>>> print("current_dir=", current_dir)
>>> setup_file_path = os.path.join(current_dir, '..', '..', 'setup.py')
>>> print("Set-up file path=", setup_file_path)
>>> #TODO:Move file to proper location
>>> uplight_telemetry_tar_file_path=os.path.join(current_dir, '..', 
>>> '..','..','non-public-dependencies', 'uplight-telemetry-1.0.0.tar.gz')
>>> # TODO:Move to environmental variables
>>> pipeline_options = {
>>> 'project': gcp_project_id,
>>> 'region': "us-east1",
>>> 'job_name': job_name,  # Provide a unique job name
>>> 'temp_location': 
>>> f'gs://{TAS_GCS_BUCKET_NAME_PREFIX}{os.getenv("UP_PLATFORM_ENV")}/temp',
>>> 'staging_location': 
>>> f'gs://{TAS_GCS_BUCKET_NAME_PREFIX}{os.getenv("UP_PLATFORM_ENV")}/staging',
>>> 'runner': 'DataflowRunner',
>>> 'save_main_session': True,
>>> 'service_account_email': os.environ.get(SERVICE_ACCOUNT),
>>> # 'network': f'projects/{gcp_project_id}/global/networks/default',
>>> 'subnetwork': os.environ.get(SUBNETWORK_URL),
>>> 'setup_file': setup_file_path,
>>> 'extra_package': uplight_telemetry_tar_file_path
>>> # 'template_location': 
>>> 'gcr.io/dataflow-templates-base/python310-template-launcher-base'
>>> }
>>> print("Pipeline created for job-name", job_name)
>>> logger.debug(f"pipeline_options created as {pipeline_options}")
>>> return pipeline_options
>>>
>>> Why is it not trying to install this package from extra_package?
>>>
>>


Re: Dataflow not able to find a module specified using extra_package

2023-12-19 Thread Robert Bradshaw via user
And should it be a list of strings, rather than a string?

On Tue, Dec 19, 2023 at 10:10 AM Anand Inguva via user 
wrote:

> Can you try passing `extra_packages` instead of `extra_package` when
> passing pipeline options as a dict?
>
> On Tue, Dec 19, 2023 at 12:26 PM Sumit Desai via user <
> user@beam.apache.org> wrote:
>
>> Hi all,
>> I have created a Dataflow pipeline in batch mode using Apache beam Python
>> SDK. I am using one non-public dependency 'uplight-telemetry'. I have
>> specified it using parameter extra_package while creating pipeline_options
>> object. However, the pipeline loading is failing with an error *No
>> module named 'uplight_telemetry'*.
>> The code to create pipeline_options is as following-
>>
>> def __create_pipeline_options_dataflow(job_name):
>> # Set up the Dataflow runner options
>> gcp_project_id = os.environ.get(GCP_PROJECT_ID)
>> current_dir = os.path.dirname(os.path.abspath(__file__))
>> print("current_dir=", current_dir)
>> setup_file_path = os.path.join(current_dir, '..', '..', 'setup.py')
>> print("Set-up file path=", setup_file_path)
>> #TODO:Move file to proper location
>> uplight_telemetry_tar_file_path=os.path.join(current_dir, '..', 
>> '..','..','non-public-dependencies', 'uplight-telemetry-1.0.0.tar.gz')
>> # TODO:Move to environmental variables
>> pipeline_options = {
>> 'project': gcp_project_id,
>> 'region': "us-east1",
>> 'job_name': job_name,  # Provide a unique job name
>> 'temp_location': 
>> f'gs://{TAS_GCS_BUCKET_NAME_PREFIX}{os.getenv("UP_PLATFORM_ENV")}/temp',
>> 'staging_location': 
>> f'gs://{TAS_GCS_BUCKET_NAME_PREFIX}{os.getenv("UP_PLATFORM_ENV")}/staging',
>> 'runner': 'DataflowRunner',
>> 'save_main_session': True,
>> 'service_account_email': os.environ.get(SERVICE_ACCOUNT),
>> # 'network': f'projects/{gcp_project_id}/global/networks/default',
>> 'subnetwork': os.environ.get(SUBNETWORK_URL),
>> 'setup_file': setup_file_path,
>> 'extra_package': uplight_telemetry_tar_file_path
>> # 'template_location': 
>> 'gcr.io/dataflow-templates-base/python310-template-launcher-base'
>> }
>> print("Pipeline created for job-name", job_name)
>> logger.debug(f"pipeline_options created as {pipeline_options}")
>> return pipeline_options
>>
>> Why is it not trying to install this package from extra_package?
>>
>


Re: Dataflow not able to find a module specified using extra_package

2023-12-19 Thread Anand Inguva via user
Can you try passing `extra_packages` instead of `extra_package` when
passing pipeline options as a dict?

On Tue, Dec 19, 2023 at 12:26 PM Sumit Desai via user 
wrote:

> Hi all,
> I have created a Dataflow pipeline in batch mode using Apache beam Python
> SDK. I am using one non-public dependency 'uplight-telemetry'. I have
> specified it using parameter extra_package while creating pipeline_options
> object. However, the pipeline loading is failing with an error *No module
> named 'uplight_telemetry'*.
> The code to create pipeline_options is as following-
>
> def __create_pipeline_options_dataflow(job_name):
> # Set up the Dataflow runner options
> gcp_project_id = os.environ.get(GCP_PROJECT_ID)
> current_dir = os.path.dirname(os.path.abspath(__file__))
> print("current_dir=", current_dir)
> setup_file_path = os.path.join(current_dir, '..', '..', 'setup.py')
> print("Set-up file path=", setup_file_path)
> #TODO:Move file to proper location
> uplight_telemetry_tar_file_path=os.path.join(current_dir, '..', 
> '..','..','non-public-dependencies', 'uplight-telemetry-1.0.0.tar.gz')
> # TODO:Move to environmental variables
> pipeline_options = {
> 'project': gcp_project_id,
> 'region': "us-east1",
> 'job_name': job_name,  # Provide a unique job name
> 'temp_location': 
> f'gs://{TAS_GCS_BUCKET_NAME_PREFIX}{os.getenv("UP_PLATFORM_ENV")}/temp',
> 'staging_location': 
> f'gs://{TAS_GCS_BUCKET_NAME_PREFIX}{os.getenv("UP_PLATFORM_ENV")}/staging',
> 'runner': 'DataflowRunner',
> 'save_main_session': True,
> 'service_account_email': os.environ.get(SERVICE_ACCOUNT),
> # 'network': f'projects/{gcp_project_id}/global/networks/default',
> 'subnetwork': os.environ.get(SUBNETWORK_URL),
> 'setup_file': setup_file_path,
> 'extra_package': uplight_telemetry_tar_file_path
> # 'template_location': 
> 'gcr.io/dataflow-templates-base/python310-template-launcher-base'
> }
> print("Pipeline created for job-name", job_name)
> logger.debug(f"pipeline_options created as {pipeline_options}")
> return pipeline_options
>
> Why is it not trying to install this package from extra_package?
>


Dataflow not able to find a module specified using extra_package

2023-12-19 Thread Sumit Desai via user
Hi all,
I have created a Dataflow pipeline in batch mode using Apache beam Python
SDK. I am using one non-public dependency 'uplight-telemetry'. I have
specified it using parameter extra_package while creating pipeline_options
object. However, the pipeline loading is failing with an error *No module
named 'uplight_telemetry'*.
The code to create pipeline_options is as following-

def __create_pipeline_options_dataflow(job_name):
# Set up the Dataflow runner options
gcp_project_id = os.environ.get(GCP_PROJECT_ID)
current_dir = os.path.dirname(os.path.abspath(__file__))
print("current_dir=", current_dir)
setup_file_path = os.path.join(current_dir, '..', '..', 'setup.py')
print("Set-up file path=", setup_file_path)
#TODO:Move file to proper location
uplight_telemetry_tar_file_path=os.path.join(current_dir, '..',
'..','..','non-public-dependencies', 'uplight-telemetry-1.0.0.tar.gz')
# TODO:Move to environmental variables
pipeline_options = {
'project': gcp_project_id,
'region': "us-east1",
'job_name': job_name,  # Provide a unique job name
'temp_location':
f'gs://{TAS_GCS_BUCKET_NAME_PREFIX}{os.getenv("UP_PLATFORM_ENV")}/temp',
'staging_location':
f'gs://{TAS_GCS_BUCKET_NAME_PREFIX}{os.getenv("UP_PLATFORM_ENV")}/staging',
'runner': 'DataflowRunner',
'save_main_session': True,
'service_account_email': os.environ.get(SERVICE_ACCOUNT),
# 'network': f'projects/{gcp_project_id}/global/networks/default',
'subnetwork': os.environ.get(SUBNETWORK_URL),
'setup_file': setup_file_path,
'extra_package': uplight_telemetry_tar_file_path
# 'template_location':
'gcr.io/dataflow-templates-base/python310-template-launcher-base'
}
print("Pipeline created for job-name", job_name)
logger.debug(f"pipeline_options created as {pipeline_options}")
return pipeline_options

Why is it not trying to install this package from extra_package?