Re: beam rebuilds numpy on pipeline run
On Fri, 2020-10-09 at 19:10 +, Ross Vandegrift wrote: > Starting today, running a beam pipeline triggers a large reinstallation of > python modules. For some reason, it forces full rebuilds from source - > since > beam depends on numpy, this takes a long time. I opened a support ticket with Google and got a workaround: move all of the dependencies from requirements.txt to the setuptools invocation in setup.py. No more numpy rebuilds. Ross
beam rebuilds numpy on pipeline run
Hello, Starting today, running a beam pipeline triggers a large reinstallation of python modules. For some reason, it forces full rebuilds from source - since beam depends on numpy, this takes a long time. There's nothing strange about my python setup. I'm using python3.7 on debian buster with the dataflow runner. My venv is setup like this: python3 -m venv ~/.venvs/beam . ~/.venvs/beam/bin/activate python3 -m pip install --upgrade wheel python3 -m pip install --upgrade pip setuptools python3 -m pip install -r requirements.txt My requirements.txt has: apache-beam[gcp]==2.23.0 boto3==1.15.0 When it's building, `ps ax | grep python` shows me this: /home/ross/.venvs/beam/bin/python -m pip download --dest /tmp/dataflow- requirements-cache -r requirements.txt --exists-action i --no-binary :all: How do I prevent this? It's far too slow to develop with, and our compliance folks are likely to prohibit a tool that silently downloads & builds unknown code. Ross
Re: Provide credentials for s3 writes
I've worked through adapting this to Dataflow, it's simple enough once you try all of the things that don't work. :) In setup.py, write out config files with an identity token and a boto3 config file. File-base config was essential, I couldn't get env vars working. Here's a sample. Be careful! This can clobber your local boto3 config. All of this is in top-level scope of setup.py: import pathlib import os import google.oauth2.id_token import google.auth.transport.requests # Get google id token request = google.auth.transport.requests.Request() id_token = google.oauth2.id_token.fetch_id_token(request, 'your-audience') with open('/tmp/id_token', 'w') as f: f.write(id_token) # create aws sdk config home = os.getenv('HOME', '/tmp') dotaws = pathlib.Path(home) / pathlib.Path('.aws') try: dotaws.mkdir() except FileExistsError: pass awsconfig = dotaws / pathlib.Path('config') if awsconfig.exists(): cfgbackup = awsconfig.parent / pathlib.Path('config.bak') awsconfig.rename(cfgbackup) with awsconfig.open('w') as f: f.write('[profile default]\n') f.write('role_arn = your-role-arn\n') f.write('web_identity_token_file = /tmp/id_token\n') You need to sub appropriate values for 'your-audience' and 'your-role-arn'. Ross On Thu, 2020-10-01 at 15:47 +, Ross Vandegrift wrote: > **This message came from an external sender.** > > > Can you explain that a little bit? Right now, our pipeline code is > structured > like this: > > if __name__ == '__main__': > setup_credentials() # exports env vars for default boto session > run_pipeline() # runs all the beam stuff > > > So I expect every worker to setup their environment before running any beam > code. This seems to work fine. Is there an issue lurking here? > > Ross > > On Wed, 2020-09-30 at 17:57 -0700, Pablo Estrada wrote: > > **This message came from an external sender.** > > > > You may need to set those up in setup.py so that the code runs in every > > worker at startup. > > > > On Wed, Sep 30, 2020, 10:16 AM Ross Vandegrift < > > ross.vandegr...@cleardata.com> wrote: > > > I see - it'd be great if the s3 io code would accept a boto session, so > > > the > > > default process could be overridden. > > > > > > But it looks like the module lazy loads boto3 and uses the default > > > session. So I think it'll work if we setup SDK env vars before the > > > pipeline > > > code. > > > > > > i.e., we'll try something like: > > > > > > os.environ['AWS_ROLE_ARN'] = 'aws:arn:...' > > > os.environ['AWS_ROLE_SESSION_NAME'] = 'my-beam-pipeline' > > > os.environ['AWS_WEB_IDENTITY_TOKEN_FILE'] = '/path/to/id_token' > > > > > > with beam.Pipline(...) as p: > > > ... > > > > > > Ross > > > > > > On Tue, 2020-09-29 at 14:29 -0700, Pablo Estrada wrote: > > > > **This message came from an external sender.** > > > > > > > > Hi Ross, > > > > it seems that this feature is missing (e.g. passing a pipeline option > > > with > > > > authentication information for AWS). I'm sorry about that - that's > > > pretty > > > > annoying. > > > > I wonder if you can use the setup.py file to add the default > > > configuration > > > > yourself while we have appropriate support for a pipeline option-based > > > > authentication. Could you try adding this default config on setup.py? > > > > Best > > > > -P. > > > > > > > > On Tue, Sep 29, 2020 at 11:16 AM Ross Vandegrift < > > > > ross.vandegr...@cleardata.com> wrote: > > > > > Hello all, > > > > > > > > > > I have a python pipeline that writes data to an s3 bucket. On my > > > laptop > > > > > it > > > > > picks up the SDK credentials from my boto3 config and works great. > > > > > > > > > > Is is possible to provide credentials explicitly? I'd like to use > > > remote > > > > > dataflow runners, which won't have implicit AWS credentials > > > > > available. > > > > > > > > > > Thanks, > > > > > Ross > > > > > > > > > > > > > This message came from an external source. Please exercise caution > > > > when > > > > opening attachments or clicking on links. > > > > This message came from an external source. Please exercise caution when > > opening attachments or clicking on links. > > This message came from an external source. Please exercise caution when > opening attachments or clicking on links.
Re: Provide credentials for s3 writes
Can you explain that a little bit? Right now, our pipeline code is structured like this: if __name__ == '__main__': setup_credentials() # exports env vars for default boto session run_pipeline() # runs all the beam stuff So I expect every worker to setup their environment before running any beam code. This seems to work fine. Is there an issue lurking here? Ross On Wed, 2020-09-30 at 17:57 -0700, Pablo Estrada wrote: > **This message came from an external sender.** > > You may need to set those up in setup.py so that the code runs in every > worker at startup. > > On Wed, Sep 30, 2020, 10:16 AM Ross Vandegrift < > ross.vandegr...@cleardata.com> wrote: > > I see - it'd be great if the s3 io code would accept a boto session, so > > the > > default process could be overridden. > > > > But it looks like the module lazy loads boto3 and uses the default > > session. So I think it'll work if we setup SDK env vars before the > > pipeline > > code. > > > > i.e., we'll try something like: > > > > os.environ['AWS_ROLE_ARN'] = 'aws:arn:...' > > os.environ['AWS_ROLE_SESSION_NAME'] = 'my-beam-pipeline' > > os.environ['AWS_WEB_IDENTITY_TOKEN_FILE'] = '/path/to/id_token' > > > > with beam.Pipline(...) as p: > > ... > > > > Ross > > > > On Tue, 2020-09-29 at 14:29 -0700, Pablo Estrada wrote: > > > **This message came from an external sender.** > > > > > > Hi Ross, > > > it seems that this feature is missing (e.g. passing a pipeline option > > with > > > authentication information for AWS). I'm sorry about that - that's > > pretty > > > annoying. > > > I wonder if you can use the setup.py file to add the default > > configuration > > > yourself while we have appropriate support for a pipeline option-based > > > authentication. Could you try adding this default config on setup.py? > > > Best > > > -P. > > > > > > On Tue, Sep 29, 2020 at 11:16 AM Ross Vandegrift < > > > ross.vandegr...@cleardata.com> wrote: > > > > Hello all, > > > > > > > > I have a python pipeline that writes data to an s3 bucket. On my > > laptop > > > > it > > > > picks up the SDK credentials from my boto3 config and works great. > > > > > > > > Is is possible to provide credentials explicitly? I'd like to use > > remote > > > > dataflow runners, which won't have implicit AWS credentials available. > > > > > > > > Thanks, > > > > Ross > > > > > > > > > > This message came from an external source. Please exercise caution when > > > opening attachments or clicking on links. > > > > This message came from an external source. Please exercise caution when > opening attachments or clicking on links.
Re: Provide credentials for s3 writes
I see - it'd be great if the s3 io code would accept a boto session, so the default process could be overridden. But it looks like the module lazy loads boto3 and uses the default session. So I think it'll work if we setup SDK env vars before the pipeline code. i.e., we'll try something like: os.environ['AWS_ROLE_ARN'] = 'aws:arn:...' os.environ['AWS_ROLE_SESSION_NAME'] = 'my-beam-pipeline' os.environ['AWS_WEB_IDENTITY_TOKEN_FILE'] = '/path/to/id_token' with beam.Pipline(...) as p: ... Ross On Tue, 2020-09-29 at 14:29 -0700, Pablo Estrada wrote: > **This message came from an external sender.** > > Hi Ross, > it seems that this feature is missing (e.g. passing a pipeline option with > authentication information for AWS). I'm sorry about that - that's pretty > annoying. > I wonder if you can use the setup.py file to add the default configuration > yourself while we have appropriate support for a pipeline option-based > authentication. Could you try adding this default config on setup.py? > Best > -P. > > On Tue, Sep 29, 2020 at 11:16 AM Ross Vandegrift < > ross.vandegr...@cleardata.com> wrote: > > Hello all, > > > > I have a python pipeline that writes data to an s3 bucket. On my laptop > > it > > picks up the SDK credentials from my boto3 config and works great. > > > > Is is possible to provide credentials explicitly? I'd like to use remote > > dataflow runners, which won't have implicit AWS credentials available. > > > > Thanks, > > Ross > > > > This message came from an external source. Please exercise caution when > opening attachments or clicking on links.