Cool.  I thought it's a misunderstanding :). Great it is clear now!

J.

On Tue, May 19, 2020 at 11:17 AM Ash Berlin-Taylor <a...@apache.org> wrote:

> Yes, sorry I got completely the wrong idea somehow. This makes sense,
> and thank you for patiently explaining it to me until I got it!
>
> My main reason for questioning was not this specific feature, but the
> gradual "scope creep" of Airflow operators.
>
> One of the hardest things we as project "stewards" have to do is say no
> to features. There have been a few examples of merged PRs that I've seen
> recently where my immediate reaction was "just use Terraform".
>
> My main worry is that "when all you have is a hammer, everything looks
> like a nail" syndrome, where we end up re-inventing
> Ansible/Terraform/CloudFormation (to give an example), but badly, and in
> operators.
>
> -ash
>
> On May 19 2020, at 7:26 am, Jarek Potiuk <jarek.pot...@polidea.com> wrote:
>
> > Let me start again from scratch and use KMS as an example. Maybe -
> > again -
> > we understand things differently:
> >
> > Just to start KMSHook -> has two methods: "encrypt" and "decrypt". I
> would
> > continue to use that as  a base.
> >
> > Again let me repeat that. I do not want to implement a generic
> SecretHook!
> > I also do not want to implement SecretOperator. I never wanted to. I
> wanted
> > to implement VaultHook, GCPSeceretManagerHook, AWSSecretManagerHook.
> >
> > *Assumptions for the use case*
> >
> > * Let's assume all secrets of airflow (Connections and Variables) are
> kept
> > in HashiCorpVault (using SecretsBackend) Airflow is configured to read
> them
> > as Variables/Connections
> > * For security reasons those secrets are read-only for Airflow. The Vault
> > is very secure - only high security admins have access there. Airflow can
> > only read from it.
> > * Additionally the company uses KMS to keep encrypted data which is more
> > "temporary" in nature but still should be kept secret and never stored
> > in a
> > traditional database. It keeps the history of those secrets and audit log
> > so that in case of any breach we can track the origin of the breach
> > * Aaccess to the KMS service from within Airflow is both READ and WRITE
> > * One of the Connections we have in the Airflow Connections (in the
> Vault)
> > are GCP credentials to both read/write from KMS. It is rotated frequently
> > in the Vault so that it's unintended use in case it leaks is limited
> > to -
> > say - 24 hours
> >
> > *The Use Case:*
> >
> > 1) We need to generate a random SEEED to start our calculations from. We
> > need the same SEED by every job as parameter. However we never want to
> > store the SEED in the Airflow database (so they cannot be passed as
> XCOM).
> > In the job we have Custom Operators that do this: (note that the
> complexity
> > of handling authentication to KMS is handled - as usual by the KMSHook.
> > KMSHook derives from GcpBaseHook and has all the complexity of
> > handling the
> > authentication implemented):
> >
> > hook = KMSHook(conn_id="service_account_for_accessing_kms_hook")
> > seed = rand()
> > hook.encrypt(key="seed<dag_id><run_id>")
> >
> > 2) Then we run each of the jobs. Those jobs use custom operators that do:
> >
> > hook = KMSHook(conn_id="service_account_for_accessing_kms_hook")
> > seed = hook.decrypt("seed<dag_id><run_id>)
> >
> > In this case we treat KMS as a database of secrets that are temporary and
> > can be used across the jobs - but never stored in a "traditional"
> database.
> > they are stored encrypted and the job has full control over the key names
> > used.
> >
> > Surely we could use GCS or any other storage for that, but KMS gives us:
> >
> > * audit logs
> > * encryption at rest
> > * history of seeds
> > * potential to destroy the secure data safely without the option of
> recovery
> >
> > 3) A the end we could even invalidate such secret if we add "delete"
> method
> > (which I have not thought about but I think it makes sense)..
> >
> > *Proposal*
> >
> > I want the same capabilities as we have now with KMSHook to be
> > available in
> > new hooks: VaultHook, GCPSeceretManagerHook, AWSSecretManagerHook. So
> that
> > they can also be used as Hooks by Airflow to access (both read and write)
> > any kind of secrets there.
> >
> > I really, really do not see why this is a problem of any sort. I
> > wonder if
> > others see it as a problem?  Ash, maybe you misunderstood the
> > intention ?
> >
> > J.
> >
> > On Tue, May 19, 2020 at 12:05 AM Ash Berlin-Taylor <a...@apache.org>
> wrote:
> >
> >> This is why I was asking for a concrete example of where you'd want to
> >> use this, right now I still can't see what problem you are aiming to
> >> solve with this proposal.
> >>
> >> So I'll ask again Jarek: Do you have a concrete use case that is driving
> >> you wanting to create SecretManagerHook?
> >>
> >> > Straightforward API to be used by the operators.
> >>
> >> We already have that, don't we? It's the SecretsBackend API.
> >>
> >>
> >> Would your ask be solved by being able to configure multiple secrets
> >> backends rather than a just a single one?
> >>
> >>
> >> > hook = SecretManagerHook(conn_id = conn_id)
> >> > hook.decrypt(key="KEY")
> >>
> >> I don't think any of the  secrets backends support encrypting/decrypting
> >> values, did you mean `hook.get_secret` here?
> >>
> >> A counter proposal: Just use Variables within the operator. There is
> >> essentially no difference between "a secret" and "a variable", and
> >> doesn't introduce a whole new concept to Airflow.
> >>
> >>
> >> The advantage of Hooks is they are know how to ask for a connection ID
> >> to find their credentials.
> >>
> >> But for a secrets backend/hook that all gets very self-referential. So
> >> to use KMSHook does that mean I need to configure the secrets backend
> >> (to look up connections) _and_ create a connection for itself in KMS so
> >> the KMSHook and connect to it?
> >>
> >> That is my major complaint I think. It strikes me as a very messy API
> >> that is prone to user confusion and hard to document.
> >>
> >>
> >> I do have one ask though if we do go down this route: that we don't end
> >> up duplicating code to speak to the Secrets providers (i.e. in hooks and
> >> in secrets backends) - it should live in one place only. (I'm sure you'd
> >> do this anyway, I just wanted to state it)
> >>
> >> -ash
> >>
> >>
> >> On May 18 2020, at 10:42 pm, Jarek Potiuk <jarek.pot...@polidea.com>
> >> wrote:
> >>
> >> > On Mon, May 18, 2020 at 11:17 PM Ash Berlin-Taylor <a...@apache.org>
> >> wrote:
> >> >
> >> >> > GCP Secret Manager Hook, Vault Hook, KMS Hook, AWS Secret Hook
> >> >>
> >> >> Why do we even need Hooks for those? Why can't we use the existing
> GCP
> >> >> Secret Manager class inside a custom operator? What does creating
> >> a hook
> >> >> give us?
> >> >>
> >> >
> >> > The same as all the other hooks. Common way to authenticate to the
> >> service
> >> > (using Airflow Connection mechanism). Straightforward API to be used
> >> > by the
> >> > operators.
> >> >
> >> > Now that Kaxil mentioned it - This is exactly what KMS hooks gives
> >> - it
> >> > cause already defined
> >> > connection id in Airflow DB to authenticate to KMS and encrypt/decrypt
> >> > secret. Please take a look there:
> >> >
> >> >
> >>
> https://github.com/apache/airflow/blob/master/airflow/providers/google/cloud/hooks/kms.py
> >> >
> >> > Then in operator I'd use it like:
> >> >
> >> > hook = KMSHook(conn_id = conn_id)
> >> > hook.decrypt(key="KEY", cipher ="") (the cipher part is
> KMS-specific)..
> >> > or
> >> >
> >> > hook = SecretManagerHook(conn_id = conn_id)
> >> > hook.decrypt(key="KEY")
> >> >
> >> > Which is substantially easier than handling all the
> >> > authentication/credential options (for example in GCP case it
> >> handles all
> >> > the different forms of authentication 0 like variables, json file,
> >> > connection-encoded-credentials) out of the box. This is the very same
> >> > reason pretty much any hook exists in Airflow. The only difference for
> >> > secrets is that it makes no sense to write operators for them
> >> because of
> >> > having to pass decrypted secrets via XCom.
> >> >
> >> > From the very beginning of this conversation, I was surprised it is at
> >> all
> >> > a problem for anyone.
> >> >
> >> > I never intended to make any shared service, I just wanted to
> implement
> >> > what Kaxil described - separate hooks for all the secrets - same as
> any
> >> > other service.
> >> > I am quite surprised it is a problem for anyone (now knowing that KMS
> >> Hook
> >> > already exist at all makes it even more surprising).
> >> > J.
> >> >
> >> >
> >> >
> >> >>
> >> >> -a
> >> >>
> >> >> On May 18 2020, at 9:50 pm, Jarek Potiuk <jarek.pot...@polidea.com>
> >> wrote:
> >> >>
> >> >> >> Are we all talking about different things ?
> >> >> >
> >> >> > Good point. I think that's the main source of confusion here and we
> >> >> > think about different things.
> >> >> >
> >> >> >> So what I feel that the use case that Nathan defined can just be
> >> >> >> solved a
> >> >> >> VaultHook & VaultOperator for example.
> >> >> >
> >> >> > That's what I was talking (from the beginning - maybe it was not
> >> >> > clear) about separate hooks for each service. Not a shared one. GCP
> >> >> > Secret Manager Hook, Vault Hook,  KMS Hook, AWS Secret Hook -
> >> all of
> >> >> > them separate, in different providers, and simple hooks to be
> >> used by
> >> >> > whoever wants to use them in their custom operators.
> >> >> >
> >> >> > We also talked about implementing operators, but there is very
> little
> >> >> > use of generic Operators for secrets. Mainly because the only way
> >> >> > operators can pass anything to other operators (tasks) is via xcom
> >> >> > which would make the secrets stored plain text in the database.
> That
> >> >> > is rather bad I am afraid. Having Hooks make them instantiatable in
> >> >> > the context of running tasks, use Fernet to decrypt credentials
> from
> >> >> > the Connection DB, request to retrieve secret from the backend and
> >> >> > pass the unencrypted secret to the other parts of the operator -
> all
> >> >> > in the context of a single worker/task.
> >> >> >
> >> >> >>
> >> >> >> This should not be confused with "Secrets" at all. Why do we
> >> need to
> >> >> create
> >> >> >> a generic Hooks for all Secrets Backend?
> >> >> >
> >> >> > No generic hooks :). I never meant it to be generic.Maybe that's a
> >> >> > confusion there - I wanted to implement a separate hook for
> >> every type
> >> >> > of backend.
> >> >> >
> >> >> >> Consider we use PostgreSQL for backend and the connection is
> >> >> defined in
> >> >> >> airflow.cfg. Now you can still use the MySQLHook and PostgresHook
> >> >> >> independently to connect to those Databases, correct.
> >> >> >>
> >> >> >> But they both should not be confused to be using anything
> "shared".
> >> >> >
> >> >> > No plans for that whatsoever.
> >> >> >
> >> >> >> The proposal if I interpret correctly talks about the following:
> >> >> >>
> >> >> >> We have an idea that we might want also (on top of the above
> >> >> SecretManager
> >> >> >> > implementation) define generic Hooks for accessing secrets from
> >> those
> >> >> >> > services (just generic secrets, not connection, variables).
> Simply
> >> >> treat
> >> >> >> > each of the backends above as another "provider" and create a
> >> >> Hook to
> >> >> >> > access the service. Such Hook could have just one method:
> >> >> >> > def get_secret(self, path_prefix: str, secret_id: str) ->
> >> >> Optional[str]
> >> >> >> > It would use a connection defined (as usual) in ENV variables or
> >> >> database
> >> >> >> > of Airflow to authenticate with the secret service and
> >> retrieve the
> >> >> >> > secrets.
> >> >> >
> >> >> > OK. maybe confusion is about 'generic' . My "generic" was ("no
> >> >> > connections, no variables") - just retrieve "generic" secret.
> Separate
> >> >> > implementation for Hashicorp Vault, Separate for Secret Manager,
> etc.
> >> >> >
> >> >> >> The connection can be defined in The Secrets backend. To make it
> >> >> clearer,
> >> >> >> "Vault" in Nathan's case is a "Service" and has nothing to do with
> >> >> >> SecretsBackend similar to how PostgresHook or MySQLHook has
> nothing
> >> >> >> to do
> >> >> >> with using Postgres as Airflow MetadataDB Backend.
> >> >> >>
> >> >> >> Another example is Google KMS, there is already Hook for Google
> >> >> KMS (
> >> >> >>
> >> >>
> >>
> https://github.com/apache/airflow/blob/master/airflow/providers/google/cloud/hooks/kms.py
> >> >> )
> >> >> >> and an Operator can be created. Same can be done for Google
> Secrets
> >> >> Manager
> >> >> >> and Hashicorp Vault, in which cases all of these are "Services".
> >> >> >
> >> >> > That's exactly what I plan to implement. As explained above -
> Operator
> >> >> > for secrets makes no sense because it would have to pass the
> secrets
> >> >> > via xcom :(. I did not even check that we already have KMS hook.
> >> I was
> >> >> > mostly about Vault and Secret Manager and AWS Secret Manager.
> Knowing
> >> >> > that we have KMS makes it even easier :).
> >> >> >
> >> >> >> We could create SecretsHook similar to DbApiHook (
> >> >> >>
> >> >>
> >>
> https://github.com/apache/airflow/blob/master/airflow/hooks/dbapi_hook.py)
> >> >> >> if we want to just define the single *get_secret* method you
> talked
> >> >> about.
> >> >> >
> >> >> > I don't even plan that in fact, I thought about implementing
> several
> >> >> > totally independent Hooks for each of the Backend Secrets.
> >> >> >
> >> >> >> The concept of "Secrets Backend" is to allow Managing of "Secrets
> >> >> >> used in
> >> >> >> Airflow" (Either to connect to an external system or Variables) in
> >> >> actual
> >> >> >> Secret Management Tools.
> >> >> >>
> >> >> >
> >> >> > Yeah. I do not - at all - want to mess with that :)
> >> >> >
> >> >> >>
> >> >> >> *Pros:*
> >> >> >> >  And I
> >> >> >> > well imagine this might be actually even more convenient to
> >> configure
> >> >> >> > connections in the DB and access secrets this way rather than
> >> >> >> having to
> >> >> >> > configure Secret Backends in Airflow configuration.
> >> >> >>
> >> >> >> This is exactly where both "Secrets" and the "Service" terms are
> >> >> >> mixed I
> >> >> >> think. Again echoing what I said above : The concept of "Secrets
> >> >> Backend"
> >> >> >> is to allow Managing of "Secrets used in Airflow".
> >> >> >> The Secrets Backend is so that you don't need to store secrets in
> >> >> Airflow
> >> >> >> Metadata DB whether they can encrypted or not as there are
> >> tools that
> >> >> are
> >> >> >> specifically designed to handle "Secrets, rotation of secrets
> etc".
> >> >> Having
> >> >> >> the Hook and Operator to talk to the Service should be separate.
> >> >> >
> >> >> > Full agreement - I do not want to intermix those. It was always
> >> >> > thought as per-provider implementation of traditional "Hook".
> >> >> >
> >> >> >>
> >> >> >> * Another benefit of it is that it would allow people still stuck
> >> >> on pre
> >> >> >> > 1.10.10 to  write custom operators that would like to use secret
> >> >> backends
> >> >> >> > (via backport operators). And still continue doing it in the
> future
> >> >> >> > (possibly migrating to 2.0/1.10.10+ in cases when there is one
> >> secret
> >> >> >> > backed only - but continue ot use connections/hooks where some
> >> >> specific
> >> >> >> > secrets shoudl be kept in different secret backend.
> >> >> >>
> >> >> >>
> >> >> >> What is the objective here: (1) is it to interact with those
> Services
> >> >> >> (Vault or Secrets Manager etc) or (2) Get Airflow Connections and
> >> >> Variables
> >> >> >> from different Secrets Backend
> >> >> >
> >> >> > Just to interact with it - no plans at all to get Airflow
> Connections
> >> >> > nor Variables.
> >> >> >
> >> >> >>
> >> >> >> Regards,
> >> >> >> Kaxil
> >> >> >>
> >> >> >
> >> >>
> >> >
> >> >
> >> > --
> >> >
> >> > Jarek Potiuk
> >> >
> >>
> >
> >
> > --
> >
> > Jarek Potiuk
> > Polidea <https://www.polidea.com/> | Principal Software Engineer
> >
> > M: +48 660 796 129 <+48660796129>
> > [image: Polidea] <https://www.polidea.com/>
> >
>


-- 

Jarek Potiuk
Polidea <https://www.polidea.com/> | Principal Software Engineer

M: +48 660 796 129 <+48660796129>
[image: Polidea] <https://www.polidea.com/>

Reply via email to