Cool. I thought it's a misunderstanding :). Great it is clear now! J.
On Tue, May 19, 2020 at 11:17 AM Ash Berlin-Taylor <a...@apache.org> wrote: > Yes, sorry I got completely the wrong idea somehow. This makes sense, > and thank you for patiently explaining it to me until I got it! > > My main reason for questioning was not this specific feature, but the > gradual "scope creep" of Airflow operators. > > One of the hardest things we as project "stewards" have to do is say no > to features. There have been a few examples of merged PRs that I've seen > recently where my immediate reaction was "just use Terraform". > > My main worry is that "when all you have is a hammer, everything looks > like a nail" syndrome, where we end up re-inventing > Ansible/Terraform/CloudFormation (to give an example), but badly, and in > operators. > > -ash > > On May 19 2020, at 7:26 am, Jarek Potiuk <jarek.pot...@polidea.com> wrote: > > > Let me start again from scratch and use KMS as an example. Maybe - > > again - > > we understand things differently: > > > > Just to start KMSHook -> has two methods: "encrypt" and "decrypt". I > would > > continue to use that as a base. > > > > Again let me repeat that. I do not want to implement a generic > SecretHook! > > I also do not want to implement SecretOperator. I never wanted to. I > wanted > > to implement VaultHook, GCPSeceretManagerHook, AWSSecretManagerHook. > > > > *Assumptions for the use case* > > > > * Let's assume all secrets of airflow (Connections and Variables) are > kept > > in HashiCorpVault (using SecretsBackend) Airflow is configured to read > them > > as Variables/Connections > > * For security reasons those secrets are read-only for Airflow. The Vault > > is very secure - only high security admins have access there. Airflow can > > only read from it. > > * Additionally the company uses KMS to keep encrypted data which is more > > "temporary" in nature but still should be kept secret and never stored > > in a > > traditional database. It keeps the history of those secrets and audit log > > so that in case of any breach we can track the origin of the breach > > * Aaccess to the KMS service from within Airflow is both READ and WRITE > > * One of the Connections we have in the Airflow Connections (in the > Vault) > > are GCP credentials to both read/write from KMS. It is rotated frequently > > in the Vault so that it's unintended use in case it leaks is limited > > to - > > say - 24 hours > > > > *The Use Case:* > > > > 1) We need to generate a random SEEED to start our calculations from. We > > need the same SEED by every job as parameter. However we never want to > > store the SEED in the Airflow database (so they cannot be passed as > XCOM). > > In the job we have Custom Operators that do this: (note that the > complexity > > of handling authentication to KMS is handled - as usual by the KMSHook. > > KMSHook derives from GcpBaseHook and has all the complexity of > > handling the > > authentication implemented): > > > > hook = KMSHook(conn_id="service_account_for_accessing_kms_hook") > > seed = rand() > > hook.encrypt(key="seed<dag_id><run_id>") > > > > 2) Then we run each of the jobs. Those jobs use custom operators that do: > > > > hook = KMSHook(conn_id="service_account_for_accessing_kms_hook") > > seed = hook.decrypt("seed<dag_id><run_id>) > > > > In this case we treat KMS as a database of secrets that are temporary and > > can be used across the jobs - but never stored in a "traditional" > database. > > they are stored encrypted and the job has full control over the key names > > used. > > > > Surely we could use GCS or any other storage for that, but KMS gives us: > > > > * audit logs > > * encryption at rest > > * history of seeds > > * potential to destroy the secure data safely without the option of > recovery > > > > 3) A the end we could even invalidate such secret if we add "delete" > method > > (which I have not thought about but I think it makes sense).. > > > > *Proposal* > > > > I want the same capabilities as we have now with KMSHook to be > > available in > > new hooks: VaultHook, GCPSeceretManagerHook, AWSSecretManagerHook. So > that > > they can also be used as Hooks by Airflow to access (both read and write) > > any kind of secrets there. > > > > I really, really do not see why this is a problem of any sort. I > > wonder if > > others see it as a problem? Ash, maybe you misunderstood the > > intention ? > > > > J. > > > > On Tue, May 19, 2020 at 12:05 AM Ash Berlin-Taylor <a...@apache.org> > wrote: > > > >> This is why I was asking for a concrete example of where you'd want to > >> use this, right now I still can't see what problem you are aiming to > >> solve with this proposal. > >> > >> So I'll ask again Jarek: Do you have a concrete use case that is driving > >> you wanting to create SecretManagerHook? > >> > >> > Straightforward API to be used by the operators. > >> > >> We already have that, don't we? It's the SecretsBackend API. > >> > >> > >> Would your ask be solved by being able to configure multiple secrets > >> backends rather than a just a single one? > >> > >> > >> > hook = SecretManagerHook(conn_id = conn_id) > >> > hook.decrypt(key="KEY") > >> > >> I don't think any of the secrets backends support encrypting/decrypting > >> values, did you mean `hook.get_secret` here? > >> > >> A counter proposal: Just use Variables within the operator. There is > >> essentially no difference between "a secret" and "a variable", and > >> doesn't introduce a whole new concept to Airflow. > >> > >> > >> The advantage of Hooks is they are know how to ask for a connection ID > >> to find their credentials. > >> > >> But for a secrets backend/hook that all gets very self-referential. So > >> to use KMSHook does that mean I need to configure the secrets backend > >> (to look up connections) _and_ create a connection for itself in KMS so > >> the KMSHook and connect to it? > >> > >> That is my major complaint I think. It strikes me as a very messy API > >> that is prone to user confusion and hard to document. > >> > >> > >> I do have one ask though if we do go down this route: that we don't end > >> up duplicating code to speak to the Secrets providers (i.e. in hooks and > >> in secrets backends) - it should live in one place only. (I'm sure you'd > >> do this anyway, I just wanted to state it) > >> > >> -ash > >> > >> > >> On May 18 2020, at 10:42 pm, Jarek Potiuk <jarek.pot...@polidea.com> > >> wrote: > >> > >> > On Mon, May 18, 2020 at 11:17 PM Ash Berlin-Taylor <a...@apache.org> > >> wrote: > >> > > >> >> > GCP Secret Manager Hook, Vault Hook, KMS Hook, AWS Secret Hook > >> >> > >> >> Why do we even need Hooks for those? Why can't we use the existing > GCP > >> >> Secret Manager class inside a custom operator? What does creating > >> a hook > >> >> give us? > >> >> > >> > > >> > The same as all the other hooks. Common way to authenticate to the > >> service > >> > (using Airflow Connection mechanism). Straightforward API to be used > >> > by the > >> > operators. > >> > > >> > Now that Kaxil mentioned it - This is exactly what KMS hooks gives > >> - it > >> > cause already defined > >> > connection id in Airflow DB to authenticate to KMS and encrypt/decrypt > >> > secret. Please take a look there: > >> > > >> > > >> > https://github.com/apache/airflow/blob/master/airflow/providers/google/cloud/hooks/kms.py > >> > > >> > Then in operator I'd use it like: > >> > > >> > hook = KMSHook(conn_id = conn_id) > >> > hook.decrypt(key="KEY", cipher ="") (the cipher part is > KMS-specific).. > >> > or > >> > > >> > hook = SecretManagerHook(conn_id = conn_id) > >> > hook.decrypt(key="KEY") > >> > > >> > Which is substantially easier than handling all the > >> > authentication/credential options (for example in GCP case it > >> handles all > >> > the different forms of authentication 0 like variables, json file, > >> > connection-encoded-credentials) out of the box. This is the very same > >> > reason pretty much any hook exists in Airflow. The only difference for > >> > secrets is that it makes no sense to write operators for them > >> because of > >> > having to pass decrypted secrets via XCom. > >> > > >> > From the very beginning of this conversation, I was surprised it is at > >> all > >> > a problem for anyone. > >> > > >> > I never intended to make any shared service, I just wanted to > implement > >> > what Kaxil described - separate hooks for all the secrets - same as > any > >> > other service. > >> > I am quite surprised it is a problem for anyone (now knowing that KMS > >> Hook > >> > already exist at all makes it even more surprising). > >> > J. > >> > > >> > > >> > > >> >> > >> >> -a > >> >> > >> >> On May 18 2020, at 9:50 pm, Jarek Potiuk <jarek.pot...@polidea.com> > >> wrote: > >> >> > >> >> >> Are we all talking about different things ? > >> >> > > >> >> > Good point. I think that's the main source of confusion here and we > >> >> > think about different things. > >> >> > > >> >> >> So what I feel that the use case that Nathan defined can just be > >> >> >> solved a > >> >> >> VaultHook & VaultOperator for example. > >> >> > > >> >> > That's what I was talking (from the beginning - maybe it was not > >> >> > clear) about separate hooks for each service. Not a shared one. GCP > >> >> > Secret Manager Hook, Vault Hook, KMS Hook, AWS Secret Hook - > >> all of > >> >> > them separate, in different providers, and simple hooks to be > >> used by > >> >> > whoever wants to use them in their custom operators. > >> >> > > >> >> > We also talked about implementing operators, but there is very > little > >> >> > use of generic Operators for secrets. Mainly because the only way > >> >> > operators can pass anything to other operators (tasks) is via xcom > >> >> > which would make the secrets stored plain text in the database. > That > >> >> > is rather bad I am afraid. Having Hooks make them instantiatable in > >> >> > the context of running tasks, use Fernet to decrypt credentials > from > >> >> > the Connection DB, request to retrieve secret from the backend and > >> >> > pass the unencrypted secret to the other parts of the operator - > all > >> >> > in the context of a single worker/task. > >> >> > > >> >> >> > >> >> >> This should not be confused with "Secrets" at all. Why do we > >> need to > >> >> create > >> >> >> a generic Hooks for all Secrets Backend? > >> >> > > >> >> > No generic hooks :). I never meant it to be generic.Maybe that's a > >> >> > confusion there - I wanted to implement a separate hook for > >> every type > >> >> > of backend. > >> >> > > >> >> >> Consider we use PostgreSQL for backend and the connection is > >> >> defined in > >> >> >> airflow.cfg. Now you can still use the MySQLHook and PostgresHook > >> >> >> independently to connect to those Databases, correct. > >> >> >> > >> >> >> But they both should not be confused to be using anything > "shared". > >> >> > > >> >> > No plans for that whatsoever. > >> >> > > >> >> >> The proposal if I interpret correctly talks about the following: > >> >> >> > >> >> >> We have an idea that we might want also (on top of the above > >> >> SecretManager > >> >> >> > implementation) define generic Hooks for accessing secrets from > >> those > >> >> >> > services (just generic secrets, not connection, variables). > Simply > >> >> treat > >> >> >> > each of the backends above as another "provider" and create a > >> >> Hook to > >> >> >> > access the service. Such Hook could have just one method: > >> >> >> > def get_secret(self, path_prefix: str, secret_id: str) -> > >> >> Optional[str] > >> >> >> > It would use a connection defined (as usual) in ENV variables or > >> >> database > >> >> >> > of Airflow to authenticate with the secret service and > >> retrieve the > >> >> >> > secrets. > >> >> > > >> >> > OK. maybe confusion is about 'generic' . My "generic" was ("no > >> >> > connections, no variables") - just retrieve "generic" secret. > Separate > >> >> > implementation for Hashicorp Vault, Separate for Secret Manager, > etc. > >> >> > > >> >> >> The connection can be defined in The Secrets backend. To make it > >> >> clearer, > >> >> >> "Vault" in Nathan's case is a "Service" and has nothing to do with > >> >> >> SecretsBackend similar to how PostgresHook or MySQLHook has > nothing > >> >> >> to do > >> >> >> with using Postgres as Airflow MetadataDB Backend. > >> >> >> > >> >> >> Another example is Google KMS, there is already Hook for Google > >> >> KMS ( > >> >> >> > >> >> > >> > https://github.com/apache/airflow/blob/master/airflow/providers/google/cloud/hooks/kms.py > >> >> ) > >> >> >> and an Operator can be created. Same can be done for Google > Secrets > >> >> Manager > >> >> >> and Hashicorp Vault, in which cases all of these are "Services". > >> >> > > >> >> > That's exactly what I plan to implement. As explained above - > Operator > >> >> > for secrets makes no sense because it would have to pass the > secrets > >> >> > via xcom :(. I did not even check that we already have KMS hook. > >> I was > >> >> > mostly about Vault and Secret Manager and AWS Secret Manager. > Knowing > >> >> > that we have KMS makes it even easier :). > >> >> > > >> >> >> We could create SecretsHook similar to DbApiHook ( > >> >> >> > >> >> > >> > https://github.com/apache/airflow/blob/master/airflow/hooks/dbapi_hook.py) > >> >> >> if we want to just define the single *get_secret* method you > talked > >> >> about. > >> >> > > >> >> > I don't even plan that in fact, I thought about implementing > several > >> >> > totally independent Hooks for each of the Backend Secrets. > >> >> > > >> >> >> The concept of "Secrets Backend" is to allow Managing of "Secrets > >> >> >> used in > >> >> >> Airflow" (Either to connect to an external system or Variables) in > >> >> actual > >> >> >> Secret Management Tools. > >> >> >> > >> >> > > >> >> > Yeah. I do not - at all - want to mess with that :) > >> >> > > >> >> >> > >> >> >> *Pros:* > >> >> >> > And I > >> >> >> > well imagine this might be actually even more convenient to > >> configure > >> >> >> > connections in the DB and access secrets this way rather than > >> >> >> having to > >> >> >> > configure Secret Backends in Airflow configuration. > >> >> >> > >> >> >> This is exactly where both "Secrets" and the "Service" terms are > >> >> >> mixed I > >> >> >> think. Again echoing what I said above : The concept of "Secrets > >> >> Backend" > >> >> >> is to allow Managing of "Secrets used in Airflow". > >> >> >> The Secrets Backend is so that you don't need to store secrets in > >> >> Airflow > >> >> >> Metadata DB whether they can encrypted or not as there are > >> tools that > >> >> are > >> >> >> specifically designed to handle "Secrets, rotation of secrets > etc". > >> >> Having > >> >> >> the Hook and Operator to talk to the Service should be separate. > >> >> > > >> >> > Full agreement - I do not want to intermix those. It was always > >> >> > thought as per-provider implementation of traditional "Hook". > >> >> > > >> >> >> > >> >> >> * Another benefit of it is that it would allow people still stuck > >> >> on pre > >> >> >> > 1.10.10 to write custom operators that would like to use secret > >> >> backends > >> >> >> > (via backport operators). And still continue doing it in the > future > >> >> >> > (possibly migrating to 2.0/1.10.10+ in cases when there is one > >> secret > >> >> >> > backed only - but continue ot use connections/hooks where some > >> >> specific > >> >> >> > secrets shoudl be kept in different secret backend. > >> >> >> > >> >> >> > >> >> >> What is the objective here: (1) is it to interact with those > Services > >> >> >> (Vault or Secrets Manager etc) or (2) Get Airflow Connections and > >> >> Variables > >> >> >> from different Secrets Backend > >> >> > > >> >> > Just to interact with it - no plans at all to get Airflow > Connections > >> >> > nor Variables. > >> >> > > >> >> >> > >> >> >> Regards, > >> >> >> Kaxil > >> >> >> > >> >> > > >> >> > >> > > >> > > >> > -- > >> > > >> > Jarek Potiuk > >> > > >> > > > > > > -- > > > > Jarek Potiuk > > Polidea <https://www.polidea.com/> | Principal Software Engineer > > > > M: +48 660 796 129 <+48660796129> > > [image: Polidea] <https://www.polidea.com/> > > > -- Jarek Potiuk Polidea <https://www.polidea.com/> | Principal Software Engineer M: +48 660 796 129 <+48660796129> [image: Polidea] <https://www.polidea.com/>