Re: beam rebuilds numpy on pipeline run

2020-10-15 Thread Ross Vandegrift
On Fri, 2020-10-09 at 19:10 +, Ross Vandegrift wrote:
> Starting today, running a beam pipeline triggers a large reinstallation of
> python modules.  For some reason, it forces full rebuilds from source -
> since
> beam depends on numpy, this takes a long time.

I opened a support ticket with Google and got a workaround: move all of the
dependencies from requirements.txt to the setuptools invocation in setup.py. 
No more numpy rebuilds.

Ross


beam rebuilds numpy on pipeline run

2020-10-09 Thread Ross Vandegrift
Hello,

Starting today, running a beam pipeline triggers a large reinstallation of
python modules.  For some reason, it forces full rebuilds from source - since
beam depends on numpy, this takes a long time.

There's nothing strange about my python setup.  I'm using python3.7 on debian
buster with the dataflow runner.  My venv is setup like this:
 python3 -m venv ~/.venvs/beam
 . ~/.venvs/beam/bin/activate
 python3 -m pip install --upgrade wheel
 python3 -m pip install --upgrade pip setuptools
 python3 -m pip install -r requirements.txt

My requirements.txt has:
  apache-beam[gcp]==2.23.0
  boto3==1.15.0

When it's building, `ps ax | grep python` shows me this:
  /home/ross/.venvs/beam/bin/python -m pip download --dest /tmp/dataflow-
requirements-cache -r requirements.txt --exists-action i --no-binary :all:

How do I prevent this?  It's far too slow to develop with, and our compliance
folks are likely to prohibit a tool that silently downloads & builds unknown
code.

Ross


Re: Provide credentials for s3 writes

2020-10-08 Thread Ross Vandegrift
I've worked through adapting this to Dataflow, it's simple enough once you try
all of the things that don't work. :)

In setup.py, write out config files with an identity token and a boto3 config
file.  File-base config was essential, I couldn't get env vars working.

Here's a sample.  Be careful!  This can clobber your local boto3 config.  All
of this is in top-level scope of setup.py:


import pathlib
import os

import google.oauth2.id_token
import google.auth.transport.requests

# Get google id token
request = google.auth.transport.requests.Request()
id_token = google.oauth2.id_token.fetch_id_token(request, 'your-audience')
with open('/tmp/id_token', 'w') as f:
f.write(id_token)

# create aws sdk config
home = os.getenv('HOME', '/tmp')
dotaws = pathlib.Path(home) / pathlib.Path('.aws')
try:
dotaws.mkdir()
except FileExistsError:
pass

awsconfig = dotaws / pathlib.Path('config')
if awsconfig.exists():
cfgbackup = awsconfig.parent / pathlib.Path('config.bak')
awsconfig.rename(cfgbackup)

with awsconfig.open('w') as f:
f.write('[profile default]\n')
f.write('role_arn = your-role-arn\n')
f.write('web_identity_token_file = /tmp/id_token\n')


You need to sub appropriate values for 'your-audience' and 'your-role-arn'.

Ross


On Thu, 2020-10-01 at 15:47 +, Ross Vandegrift wrote:
> **This message came from an external sender.**
> 
> 
> Can you explain that a little bit?  Right now, our pipeline code is
> structured
> like this:
> 
>   if __name__ == '__main__':
>   setup_credentials()  # exports env vars for default boto session
>   run_pipeline()   # runs all the beam stuff
> 
> 
> So I expect every worker to setup their environment before running any beam
> code.  This seems to work fine.  Is there an issue lurking here?
> 
> Ross
> 
> On Wed, 2020-09-30 at 17:57 -0700, Pablo Estrada wrote:
> > **This message came from an external sender.**
> > 
> > You may need to set those up in setup.py so that the code runs in every
> > worker at startup.
> > 
> > On Wed, Sep 30, 2020, 10:16 AM Ross Vandegrift <
> > ross.vandegr...@cleardata.com> wrote:
> > > I see - it'd be great if the s3 io code would accept a boto session, so
> > > the
> > > default process could be overridden.
> > > 
> > > But it looks like the module lazy loads boto3 and uses the default
> > > session.  So I think it'll work if we setup SDK env vars before the
> > > pipeline
> > > code.
> > > 
> > > i.e., we'll try something like:
> > > 
> > > os.environ['AWS_ROLE_ARN'] = 'aws:arn:...'
> > > os.environ['AWS_ROLE_SESSION_NAME'] = 'my-beam-pipeline'
> > > os.environ['AWS_WEB_IDENTITY_TOKEN_FILE'] = '/path/to/id_token'
> > > 
> > > with beam.Pipline(...) as p:
> > > ...
> > > 
> > > Ross
> > > 
> > > On Tue, 2020-09-29 at 14:29 -0700, Pablo Estrada wrote:
> > > > **This message came from an external sender.**
> > > > 
> > > > Hi Ross,
> > > > it seems that this feature is missing (e.g. passing a pipeline option
> > > with
> > > > authentication information for AWS). I'm sorry about that - that's
> > > pretty
> > > > annoying.
> > > > I wonder if you can use the setup.py file to add the default
> > > configuration
> > > > yourself while we have appropriate support for a pipeline option-based
> > > > authentication. Could you try adding this default config on setup.py?
> > > > Best
> > > > -P.
> > > > 
> > > > On Tue, Sep 29, 2020 at 11:16 AM Ross Vandegrift <
> > > > ross.vandegr...@cleardata.com> wrote:
> > > > > Hello all,
> > > > > 
> > > > > I have a python pipeline that writes data to an s3 bucket.  On my
> > > laptop
> > > > > it
> > > > > picks up the SDK credentials from my boto3 config and works great.
> > > > > 
> > > > > Is is possible to provide credentials explicitly?  I'd like to use
> > > remote
> > > > > dataflow runners, which won't have implicit AWS credentials
> > > > > available.
> > > > > 
> > > > > Thanks,
> > > > > Ross
> > > > > 
> > > > 
> > > > This message came from an external source. Please exercise caution
> > > > when
> > > > opening attachments or clicking on links.
> > 
> > This message came from an external source. Please exercise caution when
> > opening attachments or clicking on links.
> 
> This message came from an external source. Please exercise caution when
> opening attachments or clicking on links.


Re: Provide credentials for s3 writes

2020-10-01 Thread Ross Vandegrift
Can you explain that a little bit?  Right now, our pipeline code is structured
like this:

  if __name__ == '__main__':
  setup_credentials()  # exports env vars for default boto session
  run_pipeline()   # runs all the beam stuff


So I expect every worker to setup their environment before running any beam
code.  This seems to work fine.  Is there an issue lurking here?

Ross

On Wed, 2020-09-30 at 17:57 -0700, Pablo Estrada wrote:
> **This message came from an external sender.** 
> 
> You may need to set those up in setup.py so that the code runs in every
> worker at startup.
> 
> On Wed, Sep 30, 2020, 10:16 AM Ross Vandegrift <
> ross.vandegr...@cleardata.com> wrote:
> > I see - it'd be great if the s3 io code would accept a boto session, so
> > the
> > default process could be overridden.
> > 
> > But it looks like the module lazy loads boto3 and uses the default
> > session.  So I think it'll work if we setup SDK env vars before the
> > pipeline
> > code.
> > 
> > i.e., we'll try something like:
> > 
> > os.environ['AWS_ROLE_ARN'] = 'aws:arn:...'
> > os.environ['AWS_ROLE_SESSION_NAME'] = 'my-beam-pipeline'
> > os.environ['AWS_WEB_IDENTITY_TOKEN_FILE'] = '/path/to/id_token'
> > 
> > with beam.Pipline(...) as p:
> > ...
> > 
> > Ross
> > 
> > On Tue, 2020-09-29 at 14:29 -0700, Pablo Estrada wrote:
> > > **This message came from an external sender.** 
> > > 
> > > Hi Ross,
> > > it seems that this feature is missing (e.g. passing a pipeline option
> > with
> > > authentication information for AWS). I'm sorry about that - that's
> > pretty
> > > annoying.
> > > I wonder if you can use the setup.py file to add the default
> > configuration
> > > yourself while we have appropriate support for a pipeline option-based
> > > authentication. Could you try adding this default config on setup.py?
> > > Best
> > > -P.
> > > 
> > > On Tue, Sep 29, 2020 at 11:16 AM Ross Vandegrift <
> > > ross.vandegr...@cleardata.com> wrote:
> > > > Hello all,
> > > > 
> > > > I have a python pipeline that writes data to an s3 bucket.  On my
> > laptop
> > > > it
> > > > picks up the SDK credentials from my boto3 config and works great.
> > > > 
> > > > Is is possible to provide credentials explicitly?  I'd like to use
> > remote
> > > > dataflow runners, which won't have implicit AWS credentials available.
> > > > 
> > > > Thanks,
> > > > Ross
> > > > 
> > > 
> > > This message came from an external source. Please exercise caution when
> > > opening attachments or clicking on links.
> > 
> 
> This message came from an external source. Please exercise caution when
> opening attachments or clicking on links.


Re: Provide credentials for s3 writes

2020-09-30 Thread Ross Vandegrift
I see - it'd be great if the s3 io code would accept a boto session, so the
default process could be overridden.

But it looks like the module lazy loads boto3 and uses the default
session.  So I think it'll work if we setup SDK env vars before the pipeline
code.

i.e., we'll try something like:

os.environ['AWS_ROLE_ARN'] = 'aws:arn:...'
os.environ['AWS_ROLE_SESSION_NAME'] = 'my-beam-pipeline'
os.environ['AWS_WEB_IDENTITY_TOKEN_FILE'] = '/path/to/id_token'

with beam.Pipline(...) as p:
...

Ross

On Tue, 2020-09-29 at 14:29 -0700, Pablo Estrada wrote:
> **This message came from an external sender.** 
> 
> Hi Ross,
> it seems that this feature is missing (e.g. passing a pipeline option with
> authentication information for AWS). I'm sorry about that - that's pretty
> annoying.
> I wonder if you can use the setup.py file to add the default configuration
> yourself while we have appropriate support for a pipeline option-based
> authentication. Could you try adding this default config on setup.py?
> Best
> -P.
> 
> On Tue, Sep 29, 2020 at 11:16 AM Ross Vandegrift <
> ross.vandegr...@cleardata.com> wrote:
> > Hello all,
> > 
> > I have a python pipeline that writes data to an s3 bucket.  On my laptop
> > it
> > picks up the SDK credentials from my boto3 config and works great.
> > 
> > Is is possible to provide credentials explicitly?  I'd like to use remote
> > dataflow runners, which won't have implicit AWS credentials available.
> > 
> > Thanks,
> > Ross
> > 
> 
> This message came from an external source. Please exercise caution when
> opening attachments or clicking on links.