Let me start again from scratch and use KMS as an example. Maybe - again -
we understand things differently:

Just to start KMSHook -> has two methods: "encrypt" and "decrypt". I would
continue to use that as  a base.

Again let me repeat that. I do not want to implement a generic SecretHook!
I also do not want to implement SecretOperator. I never wanted to. I wanted
to implement VaultHook, GCPSeceretManagerHook, AWSSecretManagerHook.

*Assumptions for the use case*

* Let's assume all secrets of airflow (Connections and Variables) are kept
in HashiCorpVault (using SecretsBackend) Airflow is configured to read them
as Variables/Connections
* For security reasons those secrets are read-only for Airflow. The Vault
is very secure - only high security admins have access there. Airflow can
only read from it.
* Additionally the company uses KMS to keep encrypted data which is more
"temporary" in nature but still should be kept secret and never stored in a
traditional database. It keeps the history of those secrets and audit log
so that in case of any breach we can track the origin of the breach
* Aaccess to the KMS service from within Airflow is both READ and WRITE
* One of the Connections we have in the Airflow Connections (in the Vault)
are GCP credentials to both read/write from KMS. It is rotated frequently
in the Vault so that it's unintended use in case it leaks is limited to -
say - 24 hours

*The Use Case:*

1) We need to generate a random SEEED to start our calculations from. We
need the same SEED by every job as parameter. However we never want to
store the SEED in the Airflow database (so they cannot be passed as XCOM).
In the job we have Custom Operators that do this: (note that the complexity
of handling authentication to KMS is handled - as usual by the KMSHook.
KMSHook derives from GcpBaseHook and has all the complexity of handling the
authentication implemented):

hook = KMSHook(conn_id="service_account_for_accessing_kms_hook")
seed = rand()
hook.encrypt(key="seed<dag_id><run_id>")

2) Then we run each of the jobs. Those jobs use custom operators that do:

hook = KMSHook(conn_id="service_account_for_accessing_kms_hook")
seed = hook.decrypt("seed<dag_id><run_id>)

In this case we treat KMS as a database of secrets that are temporary and
can be used across the jobs - but never stored in a "traditional" database.
they are stored encrypted and the job has full control over the key names
used.

Surely we could use GCS or any other storage for that, but KMS gives us:

* audit logs
* encryption at rest
* history of seeds
* potential to destroy the secure data safely without the option of recovery

3) A the end we could even invalidate such secret if we add "delete" method
(which I have not thought about but I think it makes sense)..

*Proposal*

I want the same capabilities as we have now with KMSHook to be available in
new hooks: VaultHook, GCPSeceretManagerHook, AWSSecretManagerHook. So that
they can also be used as Hooks by Airflow to access (both read and write)
any kind of secrets there.

I really, really do not see why this is a problem of any sort. I wonder if
others see it as a problem?  Ash, maybe you misunderstood the intention ?

J.

On Tue, May 19, 2020 at 12:05 AM Ash Berlin-Taylor <a...@apache.org> wrote:

> This is why I was asking for a concrete example of where you'd want to
> use this, right now I still can't see what problem you are aiming to
> solve with this proposal.
>
> So I'll ask again Jarek: Do you have a concrete use case that is driving
> you wanting to create SecretManagerHook?
>
> > Straightforward API to be used by the operators.
>
> We already have that, don't we? It's the SecretsBackend API.
>
>
> Would your ask be solved by being able to configure multiple secrets
> backends rather than a just a single one?
>
>
> > hook = SecretManagerHook(conn_id = conn_id)
> > hook.decrypt(key="KEY")
>
> I don't think any of the  secrets backends support encrypting/decrypting
> values, did you mean `hook.get_secret` here?
>
> A counter proposal: Just use Variables within the operator. There is
> essentially no difference between "a secret" and "a variable", and
> doesn't introduce a whole new concept to Airflow.
>
>
> The advantage of Hooks is they are know how to ask for a connection ID
> to find their credentials.
>
> But for a secrets backend/hook that all gets very self-referential. So
> to use KMSHook does that mean I need to configure the secrets backend
> (to look up connections) _and_ create a connection for itself in KMS so
> the KMSHook and connect to it?
>
> That is my major complaint I think. It strikes me as a very messy API
> that is prone to user confusion and hard to document.
>
>
> I do have one ask though if we do go down this route: that we don't end
> up duplicating code to speak to the Secrets providers (i.e. in hooks and
> in secrets backends) - it should live in one place only. (I'm sure you'd
> do this anyway, I just wanted to state it)
>
> -ash
>
>
> On May 18 2020, at 10:42 pm, Jarek Potiuk <jarek.pot...@polidea.com>
> wrote:
>
> > On Mon, May 18, 2020 at 11:17 PM Ash Berlin-Taylor <a...@apache.org>
> wrote:
> >
> >> > GCP Secret Manager Hook, Vault Hook, KMS Hook, AWS Secret Hook
> >>
> >> Why do we even need Hooks for those? Why can't we use the existing GCP
> >> Secret Manager class inside a custom operator? What does creating a hook
> >> give us?
> >>
> >
> > The same as all the other hooks. Common way to authenticate to the
> service
> > (using Airflow Connection mechanism). Straightforward API to be used
> > by the
> > operators.
> >
> > Now that Kaxil mentioned it - This is exactly what KMS hooks gives - it
> > cause already defined
> > connection id in Airflow DB to authenticate to KMS and encrypt/decrypt
> > secret. Please take a look there:
> >
> >
> https://github.com/apache/airflow/blob/master/airflow/providers/google/cloud/hooks/kms.py
> >
> > Then in operator I'd use it like:
> >
> > hook = KMSHook(conn_id = conn_id)
> > hook.decrypt(key="KEY", cipher ="") (the cipher part is KMS-specific)..
> > or
> >
> > hook = SecretManagerHook(conn_id = conn_id)
> > hook.decrypt(key="KEY")
> >
> > Which is substantially easier than handling all the
> > authentication/credential options (for example in GCP case it handles all
> > the different forms of authentication 0 like variables, json file,
> > connection-encoded-credentials) out of the box. This is the very same
> > reason pretty much any hook exists in Airflow. The only difference for
> > secrets is that it makes no sense to write operators for them because of
> > having to pass decrypted secrets via XCom.
> >
> > From the very beginning of this conversation, I was surprised it is at
> all
> > a problem for anyone.
> >
> > I never intended to make any shared service, I just wanted to implement
> > what Kaxil described - separate hooks for all the secrets - same as any
> > other service.
> > I am quite surprised it is a problem for anyone (now knowing that KMS
> Hook
> > already exist at all makes it even more surprising).
> > J.
> >
> >
> >
> >>
> >> -a
> >>
> >> On May 18 2020, at 9:50 pm, Jarek Potiuk <jarek.pot...@polidea.com>
> wrote:
> >>
> >> >> Are we all talking about different things ?
> >> >
> >> > Good point. I think that's the main source of confusion here and we
> >> > think about different things.
> >> >
> >> >> So what I feel that the use case that Nathan defined can just be
> >> >> solved a
> >> >> VaultHook & VaultOperator for example.
> >> >
> >> > That's what I was talking (from the beginning - maybe it was not
> >> > clear) about separate hooks for each service. Not a shared one. GCP
> >> > Secret Manager Hook, Vault Hook,  KMS Hook, AWS Secret Hook - all of
> >> > them separate, in different providers, and simple hooks to be used by
> >> > whoever wants to use them in their custom operators.
> >> >
> >> > We also talked about implementing operators, but there is very little
> >> > use of generic Operators for secrets. Mainly because the only way
> >> > operators can pass anything to other operators (tasks) is via xcom
> >> > which would make the secrets stored plain text in the database. That
> >> > is rather bad I am afraid. Having Hooks make them instantiatable in
> >> > the context of running tasks, use Fernet to decrypt credentials from
> >> > the Connection DB, request to retrieve secret from the backend and
> >> > pass the unencrypted secret to the other parts of the operator - all
> >> > in the context of a single worker/task.
> >> >
> >> >>
> >> >> This should not be confused with "Secrets" at all. Why do we need to
> >> create
> >> >> a generic Hooks for all Secrets Backend?
> >> >
> >> > No generic hooks :). I never meant it to be generic.Maybe that's a
> >> > confusion there - I wanted to implement a separate hook for every type
> >> > of backend.
> >> >
> >> >> Consider we use PostgreSQL for backend and the connection is
> >> defined in
> >> >> airflow.cfg. Now you can still use the MySQLHook and PostgresHook
> >> >> independently to connect to those Databases, correct.
> >> >>
> >> >> But they both should not be confused to be using anything "shared".
> >> >
> >> > No plans for that whatsoever.
> >> >
> >> >> The proposal if I interpret correctly talks about the following:
> >> >>
> >> >> We have an idea that we might want also (on top of the above
> >> SecretManager
> >> >> > implementation) define generic Hooks for accessing secrets from
> those
> >> >> > services (just generic secrets, not connection, variables). Simply
> >> treat
> >> >> > each of the backends above as another "provider" and create a
> >> Hook to
> >> >> > access the service. Such Hook could have just one method:
> >> >> > def get_secret(self, path_prefix: str, secret_id: str) ->
> >> Optional[str]
> >> >> > It would use a connection defined (as usual) in ENV variables or
> >> database
> >> >> > of Airflow to authenticate with the secret service and retrieve the
> >> >> > secrets.
> >> >
> >> > OK. maybe confusion is about 'generic' . My "generic" was ("no
> >> > connections, no variables") - just retrieve "generic" secret. Separate
> >> > implementation for Hashicorp Vault, Separate for Secret Manager, etc.
> >> >
> >> >> The connection can be defined in The Secrets backend. To make it
> >> clearer,
> >> >> "Vault" in Nathan's case is a "Service" and has nothing to do with
> >> >> SecretsBackend similar to how PostgresHook or MySQLHook has nothing
> >> >> to do
> >> >> with using Postgres as Airflow MetadataDB Backend.
> >> >>
> >> >> Another example is Google KMS, there is already Hook for Google
> >> KMS (
> >> >>
> >>
> https://github.com/apache/airflow/blob/master/airflow/providers/google/cloud/hooks/kms.py
> >> )
> >> >> and an Operator can be created. Same can be done for Google Secrets
> >> Manager
> >> >> and Hashicorp Vault, in which cases all of these are "Services".
> >> >
> >> > That's exactly what I plan to implement. As explained above - Operator
> >> > for secrets makes no sense because it would have to pass the secrets
> >> > via xcom :(. I did not even check that we already have KMS hook. I was
> >> > mostly about Vault and Secret Manager and AWS Secret Manager. Knowing
> >> > that we have KMS makes it even easier :).
> >> >
> >> >> We could create SecretsHook similar to DbApiHook (
> >> >>
> >>
> https://github.com/apache/airflow/blob/master/airflow/hooks/dbapi_hook.py)
> >> >> if we want to just define the single *get_secret* method you talked
> >> about.
> >> >
> >> > I don't even plan that in fact, I thought about implementing several
> >> > totally independent Hooks for each of the Backend Secrets.
> >> >
> >> >> The concept of "Secrets Backend" is to allow Managing of "Secrets
> >> >> used in
> >> >> Airflow" (Either to connect to an external system or Variables) in
> >> actual
> >> >> Secret Management Tools.
> >> >>
> >> >
> >> > Yeah. I do not - at all - want to mess with that :)
> >> >
> >> >>
> >> >> *Pros:*
> >> >> >  And I
> >> >> > well imagine this might be actually even more convenient to
> configure
> >> >> > connections in the DB and access secrets this way rather than
> >> >> having to
> >> >> > configure Secret Backends in Airflow configuration.
> >> >>
> >> >> This is exactly where both "Secrets" and the "Service" terms are
> >> >> mixed I
> >> >> think. Again echoing what I said above : The concept of "Secrets
> >> Backend"
> >> >> is to allow Managing of "Secrets used in Airflow".
> >> >> The Secrets Backend is so that you don't need to store secrets in
> >> Airflow
> >> >> Metadata DB whether they can encrypted or not as there are tools that
> >> are
> >> >> specifically designed to handle "Secrets, rotation of secrets etc".
> >> Having
> >> >> the Hook and Operator to talk to the Service should be separate.
> >> >
> >> > Full agreement - I do not want to intermix those. It was always
> >> > thought as per-provider implementation of traditional "Hook".
> >> >
> >> >>
> >> >> * Another benefit of it is that it would allow people still stuck
> >> on pre
> >> >> > 1.10.10 to  write custom operators that would like to use secret
> >> backends
> >> >> > (via backport operators). And still continue doing it in the future
> >> >> > (possibly migrating to 2.0/1.10.10+ in cases when there is one
> secret
> >> >> > backed only - but continue ot use connections/hooks where some
> >> specific
> >> >> > secrets shoudl be kept in different secret backend.
> >> >>
> >> >>
> >> >> What is the objective here: (1) is it to interact with those Services
> >> >> (Vault or Secrets Manager etc) or (2) Get Airflow Connections and
> >> Variables
> >> >> from different Secrets Backend
> >> >
> >> > Just to interact with it - no plans at all to get Airflow Connections
> >> > nor Variables.
> >> >
> >> >>
> >> >> Regards,
> >> >> Kaxil
> >> >>
> >> >
> >>
> >
> >
> > --
> >
> > Jarek Potiuk
> >
>


-- 

Jarek Potiuk
Polidea <https://www.polidea.com/> | Principal Software Engineer

M: +48 660 796 129 <+48660796129>
[image: Polidea] <https://www.polidea.com/>

Reply via email to