Here:

https://github.com/bolkedebruin/airflow/tree/secure_connections 
<https://github.com/bolkedebruin/airflow/tree/secure_connections>

Is a working rudimentary implementation that allows securing the connections 
(only LocalExecutor at the moment)

* It enforces the use of “conn_id” instead of the mix that we have now
* A task if using “conn_id” has ‘auto-registered’ (which is a noop) its 
connections
* The scheduler reads the connection informations and serializes it to json 
(which should be a different format, protobuf preferably)
* The scheduler then sends this info to the executor
* The executor puts this in the environment of the task (environment most 
likely not secure enough for us)
* The BaseHook reads out this environment variable and does not need to touch 
the database

The example_http_operator works, I havent tested any other. To make it work I 
just adjusted the hook and operator to use “conn_id” instead 
of the non standard http_conn_id.

Makes sense? 

B.

* The BaseHook is adjusted to not connect to the database
> On 28 Jul 2018, at 17:50, Bolke de Bruin <bdbr...@gmail.com> wrote:
> 
> Well, I don’t think a hook (or task) should be obtain it by itself. It should 
> be supplied.
> At the moment you start executing the task you cannot trust it anymore (ie. 
> it is unmanaged 
> / non airflow code).
> 
> So we could change the basehook to understand supplied credentials and 
> populate
> a hash with “conn_ids”. Hooks normally call BaseHook.get_connection anyway, so
> it shouldnt be too hard and should in principle not require changes to the 
> hooks
> themselves if they are well behaved.
> 
> B.
> 
>> On 28 Jul 2018, at 17:41, Dan Davydov <ddavy...@twitter.com.INVALID 
>> <mailto:ddavy...@twitter.com.INVALID>> wrote:
>> 
>> *So basically in the scheduler we parse the dag. Either from the manifest
>> (new) or from smart parsing (probably harder, maybe some auto register?) we
>> know what connections and keytabs are available dag wide or per task.*
>> This is the hard part that I was curious about, for dynamically created
>> DAGs, e.g. those generated by reading tasks in a MySQL database or a json
>> file, there isn't a great way to do this.
>> 
>> I 100% agree with deprecating the connections table (at least for the
>> secure option). The main work there is rewriting all hooks to take
>> credentials from arbitrary data sources by allowing a customized
>> CredentialsReader class. Although hooks are technically private, I think a
>> lot of companies depend on them so the PMC should probably discuss if this
>> is an Airflow 2.0 change or not.
>> 
>> On Fri, Jul 27, 2018 at 5:24 PM Bolke de Bruin <bdbr...@gmail.com 
>> <mailto:bdbr...@gmail.com>> wrote:
>> 
>>> Sure. In general I consider keytabs as a part of connection information.
>>> Connections should be secured by sending the connection information a task
>>> needs as part of information the executor gets. A task should then not need
>>> access to the connection table in Airflow. Keytabs could then be send as
>>> part of the connection information (base64 encoded) and setup by the
>>> executor (this key) to be read only to the task it is launching.
>>> 
>>> So basically in the scheduler we parse the dag. Either from the manifest
>>> (new) or from smart parsing (probably harder, maybe some auto register?) we
>>> know what connections and keytabs are available dag wide or per task.
>>> 
>>> The credentials and connection information then are serialized into a
>>> protobuf message and send to the executor as part of the “queue” action.
>>> The worker then deserializes the information and makes it securely
>>> available to the task (which is quite hard btw).
>>> 
>>> On that last bit making the info securely available might be storing it in
>>> the Linux KEYRING (supported by python keyring). Keytabs will be tough to
>>> do properly due to Java not properly supporting KEYRING and only files and
>>> these are hard to make secure (due to the possibility a process will list
>>> all files in /tmp and get credentials through that). Maybe storing the
>>> keytab with a password and having the password in the KEYRING might work.
>>> Something to find out.
>>> 
>>> B.
>>> 
>>> Verstuurd vanaf mijn iPad
>>> 
>>>> Op 27 jul. 2018 om 22:04 heeft Dan Davydov <ddavy...@twitter.com.INVALID 
>>>> <mailto:ddavy...@twitter.com.INVALID>>
>>> het volgende geschreven:
>>>> 
>>>> I'm curious if you had any ideas in terms of ideas to enable
>>> multi-tenancy
>>>> with respect to Kerberos in Airflow.
>>>> 
>>>>> On Fri, Jul 27, 2018 at 2:38 PM Bolke de Bruin <bdbr...@gmail.com 
>>>>> <mailto:bdbr...@gmail.com>>
>>> wrote:
>>>>> 
>>>>> Cool. The doc will need some refinement as it isn't entirely accurate.
>>> In
>>>>> addition we need to separate between Airflow as a client of kerberized
>>>>> services (this is what is talked about in the astronomer doc) vs
>>>>> kerberizing airflow itself, which the API supports.
>>>>> 
>>>>> In general to access kerberized services (airflow as a client) one needs
>>>>> to start the ticket renewer with a valid keytab. For the hooks it isn't
>>>>> always required to change the hook to support it. Hadoop cli tools often
>>>>> just pick it up as their client config is set to do so. Then another
>>> class
>>>>> is there for HTTP-like services which are accessed by urllib under the
>>>>> hood, these typically use SPNEGO. These often need to be adjusted as it
>>>>> requires some urllib config. Finally, there are protocols which use SASL
>>>>> with kerberos. Like HDFS (not webhdfs, that uses SPNEGO). These require
>>> per
>>>>> protocol implementations.
>>>>> 
>>>>> From the top of my head we support kerberos client side now with:
>>>>> 
>>>>> * Spark
>>>>> * HDFS (snakebite python 2.7, cli and with the upcoming libhdfs
>>>>> implementation)
>>>>> * Hive (not metastore afaik)
>>>>> 
>>>>> Two things to remember:
>>>>> 
>>>>> * If a job (ie. Spark job) will finish later than the maximum ticket
>>>>> lifetime you probably need to provide a keytab to said application.
>>>>> Otherwise you will get failures after the expiry.
>>>>> * A keytab (used by the renewer) are credentials (user and pass) so jobs
>>>>> are executed under the keytab in use at that moment
>>>>> * Securing keytab in multi tenancy airflow is a challenge. This also
>>> goes
>>>>> for securing connections. This we need to fix at some point. Solution
>>> for
>>>>> now seems to be no multi tenancy.
>>>>> 
>>>>> Kerberos seems harder than it is btw. Still, we are sometimes moving
>>> away
>>>>> from it to OAUTH2 based authentication. This gets use closer to cloud
>>>>> standards (but we are on prem)
>>>>> 
>>>>> B.
>>>>> 
>>>>> Sent from my iPhone
>>>>> 
>>>>>> On 27 Jul 2018, at 17:41, Hitesh Shah <hit...@apache.org 
>>>>>> <mailto:hit...@apache.org>> wrote:
>>>>>> 
>>>>>> Hi Taylor
>>>>>> 
>>>>>> +1 on upstreaming this. It would be great if you can submit a pull
>>>>> request
>>>>>> to enhance the apache airflow docs.
>>>>>> 
>>>>>> thanks
>>>>>> Hitesh
>>>>>> 
>>>>>> 
>>>>>>> On Thu, Jul 26, 2018 at 2:32 PM Taylor Edmiston <tedmis...@gmail.com 
>>>>>>> <mailto:tedmis...@gmail.com>>
>>>>> wrote:
>>>>>>> 
>>>>>>> While we're on the topic, I'd love any feedback from Bolke or others
>>>>> who've
>>>>>>> used Kerberos with Airflow on this quick guide I put together
>>> yesterday.
>>>>>>> It's similar to what's in the Airflow docs but instead all on one page
>>>>>>> and slightly
>>>>>>> expanded.
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>> 
>>> https://github.com/astronomerio/airflow-guides/blob/master/guides/kerberos.md
>>>  
>>> <https://github.com/astronomerio/airflow-guides/blob/master/guides/kerberos.md>
>>>>>>> (or web version <https://www.astronomer.io/guides/kerberos/>)
>>>>>>> 
>>>>>>> One thing I'd like to add is a minimal example of how to Kerberize a
>>>>> hook.
>>>>>>> 
>>>>>>> I'd be happy to upstream this as well if it's useful (maybe a
>>> Concepts >
>>>>>>> Additional Functionality > Kerberos page?)
>>>>>>> 
>>>>>>> Best,
>>>>>>> Taylor
>>>>>>> 
>>>>>>> 
>>>>>>> *Taylor Edmiston*
>>>>>>> Blog <https://blog.tedmiston.com/> | CV
>>>>>>> <https://stackoverflow.com/cv/taylor> | LinkedIn
>>>>>>> <https://www.linkedin.com/in/tedmiston/> | AngelList
>>>>>>> <https://angel.co/taylor> | Stack Overflow
>>>>>>> <https://stackoverflow.com/users/149428/taylor-edmiston>
>>>>>>> 
>>>>>>> 
>>>>>>> On Thu, Jul 26, 2018 at 5:18 PM, Driesprong, Fokko
>>> <fo...@driesprong.frl
>>>>>> 
>>>>>>> wrote:
>>>>>>> 
>>>>>>>> Hi Ry,
>>>>>>>> 
>>>>>>>> You should ask Bolke de Bruin. He's really experienced with Kerberos
>>>>> and
>>>>>>> he
>>>>>>>> did also the implementation for Airflow. Beside that he worked also
>>> on
>>>>>>>> implementing Kerberos in Ambari. Just want to let you know.
>>>>>>>> 
>>>>>>>> Cheers, Fokko
>>>>>>>> 
>>>>>>>> Op do 26 jul. 2018 om 23:03 schreef Ry Walker <r...@astronomer.io>
>>>>>>>> 
>>>>>>>>> Hi everyone -
>>>>>>>>> 
>>>>>>>>> We have several bigCo's who are considering using Airflow asking
>>> into
>>>>>>> its
>>>>>>>>> support for Kerberos.
>>>>>>>>> 
>>>>>>>>> We're going to work on a proof-of-concept next week, will likely
>>>>>>> record a
>>>>>>>>> screencast on it.
>>>>>>>>> 
>>>>>>>>> For now, we're looking for any anecdotal information from
>>>>> organizations
>>>>>>>> who
>>>>>>>>> are using Kerberos with Airflow, if anyone would be willing to share
>>>>>>>> their
>>>>>>>>> experiences here, or reply to me personally, it would be greatly
>>>>>>>>> appreciated!
>>>>>>>>> 
>>>>>>>>> -Ry
>>>>>>>>> 
>>>>>>>>> --
>>>>>>>>> 
>>>>>>>>> *Ry Walker* | CEO, Astronomer <http://www.astronomer.io/> |
>>>>>>>> 513.417.2163 |
>>>>>>>>> @rywalker <http://twitter.com/rywalker> | LinkedIn
>>>>>>>>> <http://www.linkedin.com/in/rywalker>
>>>>>>> 
>>>>> 
>>> 
> 

Reply via email to