Hi folks,

Do you know what’s the best method to passing secrets to Spark operations, for 
e.g doing encryption, salting with a secret before hashing etc.?
I have multiple ideas on top of my head

The secret's source:
- environment variable
- config property
- remote service accessed through an API.

Passing to the executors:
1. Driver resolves the secret
   a. it passes it to the encryption function as an argument, which ends up as 
an argument to a UDF/gets interpolated in the expression’s generated code.
   b. it passes it to the encryption function as a literal expression. For 
security, I can create a SecretLiteral expression that redacts it from the 
pretty printed and SQL versions. Are there any other concerns here?

2. Executors resolves the secret
   a. e.g. reads it from an env/config/service, only the env var name/property 
name/path/URI is passed as part of the plan. I need to cache the secret on the 
executor to prevent a performance hit especially in the remote service case.
   b. Similarly to (1.b), I can create an expression that resolves the secret 
during execution.

In (1) the secret will be passed as part of the plan, so the RPC connections 
have to be encrypted if the attacker can sniff on the network for secrets. 1.b 
and 2.b is superior for composing with existing expressions, e.g 
`sha1(concat(colToMask, secretLit(“mySecret")))` for masking a column 
deterministically using a cryptographic hash function and a secret salt. (2) 
might involve a more complicated design than (1).

If you can point me to existing work in this space it would be a great help!

Thanks in advance,
David Szakallas




Reply via email to