Re: beam main file with dependencies
Marco To add upon the others answers, there are 2 ways I add dependencies on my jobs. In both cases, you need a setup.py like this: from setuptools import setup, find_packages setup( name="dependencies", version="0.0.1", packages=find_packages(), install_requires=[ 'pymssql==2.1.4', 'google-cloud-storage==1.22.0'], ) With only this on your setup file, you will be able to add dependencies. 1) add a setup file: when you run you job, you have to add a --setup_file. So, it would be like this: python -m main_file.py --runner=dataflow --project=myproject --template_location=gs://mybucket/my_template --temp_location=gs://mybucket/temp --staging_location=gs://mybucket/staging --setup_file home/path/to/ setup.py 2) extra package: >From your setup, you can create a package you add to your job. To do so, you need to run: python setup.py sdist The file created from it you add to your job with the parameter --extra_package python -m main_file.py --runner=dataflow --project=myproject --template_location=gs://mybucket/my_template --temp_location=gs://mybucket/temp --staging_location=gs://mybucket/staging --extra_package dist/dependencies-0.0.1.tar.gz Good luck! André Rocha Data Engineer On Fri, Jan 17, 2020 at 8:35 AM Chris Swart wrote: > Hey Marco, you will need to package your application in a module the > Juliaset example shows you how you could go about it > https://github.com/apache/beam/tree/master/sdks/python/apache_beam/examples/complete/juliaset > Best > wishes, Chris > > On Thu, Jan 16, 2020 at 10:00 PM Marco Mistroni > wrote: > >> Hello all >> i have written an apache beam workflow which i have splitted across two >> file >> - main_file.py contains the pipeline >> - utils.py (which contains few functions used in the pipeline) >> >> I have created template for this using the command below >> >> python -m main_file.py --runner=dataflow --project=myproject >> --template_location=gs://mybucket/my_template >> --temp_location=gs://mybucket/temp --staging_location=gs://mybucket/staging >> >> and i have attempted to create a job using this template. >> However, when i kick off the job i am getting exceptions such as >> >> >> Traceback (most recent call last): File >> "/usr/local/lib/python3.5/site-packages/apache_beam/internal/pickler.py", >> line 261, in loads return dill.loads(s) File >> "/usr/local/lib/python3.5/site-packages/dill/_dill.py", line 317, in loads >> return load(file, ignore) File >> "/usr/local/lib/python3.5/site-packages/dill/_dill.py", line 305, in load >> obj = pik.load() File >> "/usr/local/lib/python3.5/site-packages/dill/_dill.py", line 474, in >> find_class return StockUnpickler.find_class(self, module, name) >> ImportError: No module named 'utils' >> I am guessing i am missign some steps in packaging the application, or >> perhaps some extra options to specify dependencies? >> i would not imagine writing a whole workflow in one file, so this looks >> like a standard usecase ? >> >> kind regards >> >> >> >> >>
Re: beam main file with dependencies
Hey Marco, you will need to package your application in a module the Juliaset example shows you how you could go about it https://github.com/apache/beam/tree/master/sdks/python/apache_beam/examples/complete/juliaset Best wishes, Chris On Thu, Jan 16, 2020 at 10:00 PM Marco Mistroni wrote: > Hello all > i have written an apache beam workflow which i have splitted across two > file > - main_file.py contains the pipeline > - utils.py (which contains few functions used in the pipeline) > > I have created template for this using the command below > > python -m main_file.py --runner=dataflow --project=myproject > --template_location=gs://mybucket/my_template > --temp_location=gs://mybucket/temp --staging_location=gs://mybucket/staging > > and i have attempted to create a job using this template. > However, when i kick off the job i am getting exceptions such as > > > Traceback (most recent call last): File > "/usr/local/lib/python3.5/site-packages/apache_beam/internal/pickler.py", > line 261, in loads return dill.loads(s) File > "/usr/local/lib/python3.5/site-packages/dill/_dill.py", line 317, in loads > return load(file, ignore) File > "/usr/local/lib/python3.5/site-packages/dill/_dill.py", line 305, in load > obj = pik.load() File > "/usr/local/lib/python3.5/site-packages/dill/_dill.py", line 474, in > find_class return StockUnpickler.find_class(self, module, name) > ImportError: No module named 'utils' > I am guessing i am missign some steps in packaging the application, or > perhaps some extra options to specify dependencies? > i would not imagine writing a whole workflow in one file, so this looks > like a standard usecase ? > > kind regards > > > > >
Re: beam main file with dependencies
Yes, you'll need to bundle up these dependencies in a way that they can be shipped to the workers. See https://beam.apache.org/documentation/sdks/python-pipeline-dependencies/ On Thu, Jan 16, 2020 at 2:00 PM Marco Mistroni wrote: > > Hello all > i have written an apache beam workflow which i have splitted across two file > - main_file.py contains the pipeline > - utils.py (which contains few functions used in the pipeline) > > I have created template for this using the command below > > python -m main_file.py --runner=dataflow --project=myproject > --template_location=gs://mybucket/my_template > --temp_location=gs://mybucket/temp --staging_location=gs://mybucket/staging > > and i have attempted to create a job using this template. > However, when i kick off the job i am getting exceptions such as > > > Traceback (most recent call last): File > "/usr/local/lib/python3.5/site-packages/apache_beam/internal/pickler.py", > line 261, in loads return dill.loads(s) File > "/usr/local/lib/python3.5/site-packages/dill/_dill.py", line 317, in loads > return load(file, ignore) File > "/usr/local/lib/python3.5/site-packages/dill/_dill.py", line 305, in load obj > = pik.load() File "/usr/local/lib/python3.5/site-packages/dill/_dill.py", > line 474, in find_class return StockUnpickler.find_class(self, module, name) > ImportError: No module named 'utils' > I am guessing i am missign some steps in packaging the application, or > perhaps some extra options to specify dependencies? > i would not imagine writing a whole workflow in one file, so this looks like > a standard usecase ? > > kind regards > > > >
beam main file with dependencies
Hello all i have written an apache beam workflow which i have splitted across two file - main_file.py contains the pipeline - utils.py (which contains few functions used in the pipeline) I have created template for this using the command below python -m main_file.py --runner=dataflow --project=myproject --template_location=gs://mybucket/my_template --temp_location=gs://mybucket/temp --staging_location=gs://mybucket/staging and i have attempted to create a job using this template. However, when i kick off the job i am getting exceptions such as Traceback (most recent call last): File "/usr/local/lib/python3.5/site-packages/apache_beam/internal/pickler.py", line 261, in loads return dill.loads(s) File "/usr/local/lib/python3.5/site-packages/dill/_dill.py", line 317, in loads return load(file, ignore) File "/usr/local/lib/python3.5/site-packages/dill/_dill.py", line 305, in load obj = pik.load() File "/usr/local/lib/python3.5/site-packages/dill/_dill.py", line 474, in find_class return StockUnpickler.find_class(self, module, name) ImportError: No module named 'utils' I am guessing i am missign some steps in packaging the application, or perhaps some extra options to specify dependencies? i would not imagine writing a whole workflow in one file, so this looks like a standard usecase ? kind regards