Thanks Jarek and Kaxil for the feedback.

*Update on naming*
I have adopted the suggestion to name it "Secrets Backend" and updated the
AIP + PR and the subject of this thread.

*Discussion of proposed config simplification*
The status quo is that user could chose any number of backends to enable
simultaneously, and can chose any ordering of the search path.

I am inclined to adopt Kaxil's suggestion, namely that we simplify the
config by making it so the search path is not configurable.  With this
simplification, you can only specify one secrets backend (apart from the
default of env and DB).

Having to support multiple backends simultaneously means that you need to
provide a way to configure each backend simultaneously.  Which means you
may have to namespace your configs with a prefix so like this:

[secrets_backend]
class_list = ssm, gkms, envvar, metastore  # abbreviations merely out of
laziness
aws_ssm_secrets_backend_kwargs = ...
vault_secrets_backend_kwargs = ...
gkms_secrets_backend_kwargs = ...

Or find some way to combine them, like incorporating them with search path
definition:
secrets_backend_config = {"ssm": {"my": "param"}, "kms": {"my": "config"}
"envvar": null, "metastore": null}

But, by restricting to just one, you could do this with the config:
[secrets_backend]
class_name = path.to.my.CustomSecretsBackend
secrets_backend_kwargs = {"whatever": "information"}

It would still be easy for a user to write a secrets backend that searches
multiple sources.  And given this, I think the simplicity of the config
wins the argument.

Nevertheless I'll hold off on making the change for a bit and see if there
are any objections or other thoughts.



On Sun, Mar 8, 2020 at 1:33 PM Kaxil Naik <[email protected]> wrote:

> Thanks, Daniel, a *big "+1"* to the AIP and the idea of integrating
> external vaults and using them as connections. Was in the roadmap of items
> I wanted to work on so very happy that it is coming together and we are
> already discussing it.
>
> I have also looked at the PR and for the most part, it looks good to me.
> Following are my comments
>
>
>    1. "*SecretsBackend*" is a more intuitive name in my opinion (we can
>    have a vote in case we don't come to a conclusion :D)
>    2. I think we should have *creds_backend_kwargs* (either separate
>    section or a config allowing JSON-string) or something similar to avoid
>    hardcoding configs (like *aws_ssm_prefix*/*aws_ssm_profile_name*)
>    related to a specific provider in *airflow.cfg*.
>    If we don't do that we would need to hardcode these configs when I (or
>    someone) adds support for Google's KMS and Hashicorp Vault :)
>    3. How do we authenticate to the Secrets/Creds Backend?
>    For Google Cloud KMS or Hashicorp Vault, the main way of authenticating
>    I think is using Environment Variable.
>    Of course (2) can solve this, for example, we can pass
>    *GOOGLE_APPLICATION_CREDENTIAL* JSON file path to
> *creds_backend_kwargs *which
>    would then authenticate with GCP
>    Or do we want to get that value from* Connection object stored in DB*?
>
>
> Regards,
> Kaxil
>
>
> On Sun, Mar 8, 2020 at 6:41 AM Daniel Standish <[email protected]>
> wrote:
>
> > Thanks Jarek for adding the PR link.  In addition to forgetting to
> include
> > that, I put the wrong AIP number in subject -- fixed now (AIP-33 is the
> > correct AIP).
> >
> > I like the name Secrets -- I will let the other ideas percolate and will
> > look into your suggestion about using the plugin system.
> >
> >
> > On Sat, Mar 7, 2020 at 8:09 PM Jarek Potiuk <[email protected]>
> > wrote:
> >
> > > +1. I like it. And I like the general direction of the name as well not
> > to
> > > be tied with connections.
> > >
> > > I think it's also worth mentioning that there is PR in progress about
> it
> > > here: https://github.com/apache/airflow/pull/6376/files
> > >
> > > I have one observation though. We might not necessarily go in this
> > > direction or maybe defer it to later discussion, but I thought about
> > > extending the use of such CredentialsBackend and make it simply a
> > > SecretsBackend (that would be a nice solution to discussing various
> names
> > > for it :)).  Daniel - you might consider that as a "loose" comment.
> > >
> > > I do not want to hijack your topic here, so feel free to cut the
> > discussion
> > > here if you think it's going too far from the original proposal.
> > >
> > > Why don't we change it to a bit more generic "SecretsBackend" and
> > provide a
> > > more generic interface to use by the Hooks/Operators? I think the
> current
> > > implementation is not very far from it. Additionally to just providing
> a
> > > "get_connection" method it could provide much more generic "get_secret"
> > and
> > > this would give the Hooks/Operators an opportunity to use it to
> retrieve
> > > all various kinds of secrets, not only Connections. We already have a
> > > half-baked solution for that with the _COMMAND variables and maybe we
> can
> > > incorporate this in backwards-compatible way in the "SecretBackend'
> > > similarly as you did with Metastore and Env Backends.
> > >
> > > I think this would be a very handy feature - some of the secrets are
> not
> > > necessary in the URI form, they might be much more complex and contain
> > > better structured data and might be needed or available at different
> > times
> > > than when connection is retrieved. Often secrets are short-living and
> > they
> > > should be retrieved "just" before they are used - limiting it only to
> > when
> > > the Hook is created and only to URI is quite a limitation I think.
> > >
> > > Again - feel free to disregard that comment, I just thought it might
> be a
> > > handy feature and maybe it's a good time to add it.
> > >
> > > J.
> > >
> > >
> > > On Sun, Mar 8, 2020 at 1:03 AM Daniel Standish <[email protected]>
> > > wrote:
> > >
> > > > We can currently retrieve connections from environment variables or
> the
> > > > metastore database.
> > > >
> > > > AIP-33
> > > > <
> > >
> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-33+Creds+backend
> > >
> > > > provides
> > > > a way to retrieve them from other sources, for example AWS SSM
> > parameter
> > > > store.
> > > >
> > > > There are many instances in airflow where we allow for user
> > customization
> > > > like this, for example with auth backend and hostname callable -- it
> > just
> > > > hasn't been done yet with creds.
> > > >
> > > > *How is it implemented?*
> > > > This adds a base class BaseCredsBackend which takes over BaseHook's
> > > > get_connections method.  Then EnvironmentVariablesCredsBackend and
> > > > MetastoreCredsBackend are added as subclasses.  New implementations
> can
> > > be
> > > > added and user can configure precedence.  E.g. instead of the default
> > env
> > > > vars > metastore, a user could specify SSM > metastore > env vars in
> > > > airflow.cfg.
> > > >
> > > > *Does this break anything?*
> > > > No. This is a relatively simple refactor that results in no change in
> > the
> > > > default behavior unless user modifies config to enable an alternative
> > > > backend.
> > > >
> > > > *Why is this worth adding?*
> > > > If implemented, you could store creds anywhere, as long as you can
> > write
> > > a
> > > > `get_connections` method that retrieves them.
> > > > This can make it possible for a team of devs to share one creds
> source
> > > > rather than each dev storing them in text files, for example.
> > > > Can also make it easier to spin up a dev instance since you don't
> have
> > to
> > > > worry about loading creds.
> > > > And if you don't have access to the airflow CLI (because you are on a
> > > cloud
> > > > platform perhaps) then this can provide an easy way to load and
> manage
> > > > creds.
> > > >
> > > > *Why do you call it creds?*
> > > > I don't need to call it creds.  I could call it BaseConnectionBackend
> > or
> > > > BaseConnectionInfoBackend or any other thing -- I don't care too
> much.
> > > > There was something about calling it connection backend that was
> > > > unsatisfying.  Perhaps because connection is a bit of an ambiguous
> > word.
> > > > So even though Connection is the name of the model that we
> instantiate
> > > when
> > > > retrieving these creds, creds seemed more specific and more
> > > > representative.  But again if there is consensus on something else, I
> > am
> > > > happy to change.
> > > > This adds a `creds` package under `airflow`.  Alternatively would we
> > > create
> > > > a `connections` package?  Would this cause confusion relative to the
> > > > *model*
> > > > Connection?
> > > > Another thought: even though today this backend will produce a
> > connection
> > > > object, maybe in the future it can produce a different model --
> either
> > > way,
> > > > this backend is still about retrieving *creds*.
> > > >
> > > > *Outstanding questions*
> > > > As mentioned above, some don't like using the word Creds because this
> > is
> > > > about producing connections.  I totally get this.  If there is
> general
> > > > consensus around another name, I am happy to change it.
> > > > I welcome any other suggestions or feedback on the structure of this
> > > > implementation.
> > > >
> > >
> > >
> > > --
> > >
> > > Jarek Potiuk
> > > Polidea <https://www.polidea.com/> | Principal Software Engineer
> > >
> > > M: +48 660 796 129 <+48660796129>
> > > [image: Polidea] <https://www.polidea.com/>
> > >
> >
>

Reply via email to