This makes sense, and thanks for putting this together. I might pick this up myself depending on if we can get the rest of the mutli-tenancy story nailed down, but I still think the tricky part is figuring out how to allow dynamic DAGs (e.g. DAGs created from rows in a Mysql table) to work with Kerberos, curious what your thoughts are there. How would secrets be passed securely in a multi-tenant Scheduler starting from parsing the DAGs up to the executor sending them off?
On Sat, Jul 28, 2018 at 5:07 PM Bolke de Bruin <bdbr...@gmail.com> wrote: > Here: > > https://github.com/bolkedebruin/airflow/tree/secure_connections < > https://github.com/bolkedebruin/airflow/tree/secure_connections> > > Is a working rudimentary implementation that allows securing the > connections (only LocalExecutor at the moment) > > * It enforces the use of “conn_id” instead of the mix that we have now > * A task if using “conn_id” has ‘auto-registered’ (which is a noop) its > connections > * The scheduler reads the connection informations and serializes it to > json (which should be a different format, protobuf preferably) > * The scheduler then sends this info to the executor > * The executor puts this in the environment of the task (environment most > likely not secure enough for us) > * The BaseHook reads out this environment variable and does not need to > touch the database > > The example_http_operator works, I havent tested any other. To make it > work I just adjusted the hook and operator to use “conn_id” instead > of the non standard http_conn_id. > > Makes sense? > > B. > > * The BaseHook is adjusted to not connect to the database > > On 28 Jul 2018, at 17:50, Bolke de Bruin <bdbr...@gmail.com> wrote: > > > > Well, I don’t think a hook (or task) should be obtain it by itself. It > should be supplied. > > At the moment you start executing the task you cannot trust it anymore > (ie. it is unmanaged > > / non airflow code). > > > > So we could change the basehook to understand supplied credentials and > populate > > a hash with “conn_ids”. Hooks normally call BaseHook.get_connection > anyway, so > > it shouldnt be too hard and should in principle not require changes to > the hooks > > themselves if they are well behaved. > > > > B. > > > >> On 28 Jul 2018, at 17:41, Dan Davydov <ddavy...@twitter.com.INVALID > <mailto:ddavy...@twitter.com.INVALID>> wrote: > >> > >> *So basically in the scheduler we parse the dag. Either from the > manifest > >> (new) or from smart parsing (probably harder, maybe some auto > register?) we > >> know what connections and keytabs are available dag wide or per task.* > >> This is the hard part that I was curious about, for dynamically created > >> DAGs, e.g. those generated by reading tasks in a MySQL database or a > json > >> file, there isn't a great way to do this. > >> > >> I 100% agree with deprecating the connections table (at least for the > >> secure option). The main work there is rewriting all hooks to take > >> credentials from arbitrary data sources by allowing a customized > >> CredentialsReader class. Although hooks are technically private, I > think a > >> lot of companies depend on them so the PMC should probably discuss if > this > >> is an Airflow 2.0 change or not. > >> > >> On Fri, Jul 27, 2018 at 5:24 PM Bolke de Bruin <bdbr...@gmail.com > <mailto:bdbr...@gmail.com>> wrote: > >> > >>> Sure. In general I consider keytabs as a part of connection > information. > >>> Connections should be secured by sending the connection information a > task > >>> needs as part of information the executor gets. A task should then not > need > >>> access to the connection table in Airflow. Keytabs could then be send > as > >>> part of the connection information (base64 encoded) and setup by the > >>> executor (this key) to be read only to the task it is launching. > >>> > >>> So basically in the scheduler we parse the dag. Either from the > manifest > >>> (new) or from smart parsing (probably harder, maybe some auto > register?) we > >>> know what connections and keytabs are available dag wide or per task. > >>> > >>> The credentials and connection information then are serialized into a > >>> protobuf message and send to the executor as part of the “queue” > action. > >>> The worker then deserializes the information and makes it securely > >>> available to the task (which is quite hard btw). > >>> > >>> On that last bit making the info securely available might be storing > it in > >>> the Linux KEYRING (supported by python keyring). Keytabs will be tough > to > >>> do properly due to Java not properly supporting KEYRING and only files > and > >>> these are hard to make secure (due to the possibility a process will > list > >>> all files in /tmp and get credentials through that). Maybe storing the > >>> keytab with a password and having the password in the KEYRING might > work. > >>> Something to find out. > >>> > >>> B. > >>> > >>> Verstuurd vanaf mijn iPad > >>> > >>>> Op 27 jul. 2018 om 22:04 heeft Dan Davydov > <ddavy...@twitter.com.INVALID <mailto:ddavy...@twitter.com.INVALID>> > >>> het volgende geschreven: > >>>> > >>>> I'm curious if you had any ideas in terms of ideas to enable > >>> multi-tenancy > >>>> with respect to Kerberos in Airflow. > >>>> > >>>>> On Fri, Jul 27, 2018 at 2:38 PM Bolke de Bruin <bdbr...@gmail.com > <mailto:bdbr...@gmail.com>> > >>> wrote: > >>>>> > >>>>> Cool. The doc will need some refinement as it isn't entirely > accurate. > >>> In > >>>>> addition we need to separate between Airflow as a client of > kerberized > >>>>> services (this is what is talked about in the astronomer doc) vs > >>>>> kerberizing airflow itself, which the API supports. > >>>>> > >>>>> In general to access kerberized services (airflow as a client) one > needs > >>>>> to start the ticket renewer with a valid keytab. For the hooks it > isn't > >>>>> always required to change the hook to support it. Hadoop cli tools > often > >>>>> just pick it up as their client config is set to do so. Then another > >>> class > >>>>> is there for HTTP-like services which are accessed by urllib under > the > >>>>> hood, these typically use SPNEGO. These often need to be adjusted as > it > >>>>> requires some urllib config. Finally, there are protocols which use > SASL > >>>>> with kerberos. Like HDFS (not webhdfs, that uses SPNEGO). These > require > >>> per > >>>>> protocol implementations. > >>>>> > >>>>> From the top of my head we support kerberos client side now with: > >>>>> > >>>>> * Spark > >>>>> * HDFS (snakebite python 2.7, cli and with the upcoming libhdfs > >>>>> implementation) > >>>>> * Hive (not metastore afaik) > >>>>> > >>>>> Two things to remember: > >>>>> > >>>>> * If a job (ie. Spark job) will finish later than the maximum ticket > >>>>> lifetime you probably need to provide a keytab to said application. > >>>>> Otherwise you will get failures after the expiry. > >>>>> * A keytab (used by the renewer) are credentials (user and pass) so > jobs > >>>>> are executed under the keytab in use at that moment > >>>>> * Securing keytab in multi tenancy airflow is a challenge. This also > >>> goes > >>>>> for securing connections. This we need to fix at some point. Solution > >>> for > >>>>> now seems to be no multi tenancy. > >>>>> > >>>>> Kerberos seems harder than it is btw. Still, we are sometimes moving > >>> away > >>>>> from it to OAUTH2 based authentication. This gets use closer to cloud > >>>>> standards (but we are on prem) > >>>>> > >>>>> B. > >>>>> > >>>>> Sent from my iPhone > >>>>> > >>>>>> On 27 Jul 2018, at 17:41, Hitesh Shah <hit...@apache.org <mailto: > hit...@apache.org>> wrote: > >>>>>> > >>>>>> Hi Taylor > >>>>>> > >>>>>> +1 on upstreaming this. It would be great if you can submit a pull > >>>>> request > >>>>>> to enhance the apache airflow docs. > >>>>>> > >>>>>> thanks > >>>>>> Hitesh > >>>>>> > >>>>>> > >>>>>>> On Thu, Jul 26, 2018 at 2:32 PM Taylor Edmiston < > tedmis...@gmail.com <mailto:tedmis...@gmail.com>> > >>>>> wrote: > >>>>>>> > >>>>>>> While we're on the topic, I'd love any feedback from Bolke or > others > >>>>> who've > >>>>>>> used Kerberos with Airflow on this quick guide I put together > >>> yesterday. > >>>>>>> It's similar to what's in the Airflow docs but instead all on one > page > >>>>>>> and slightly > >>>>>>> expanded. > >>>>>>> > >>>>>>> > >>>>>>> > >>>>> > >>> > https://github.com/astronomerio/airflow-guides/blob/master/guides/kerberos.md > < > https://github.com/astronomerio/airflow-guides/blob/master/guides/kerberos.md > > > >>>>>>> (or web version <https://www.astronomer.io/guides/kerberos/>) > >>>>>>> > >>>>>>> One thing I'd like to add is a minimal example of how to Kerberize > a > >>>>> hook. > >>>>>>> > >>>>>>> I'd be happy to upstream this as well if it's useful (maybe a > >>> Concepts > > >>>>>>> Additional Functionality > Kerberos page?) > >>>>>>> > >>>>>>> Best, > >>>>>>> Taylor > >>>>>>> > >>>>>>> > >>>>>>> *Taylor Edmiston* > >>>>>>> Blog <https://blog.tedmiston.com/> | CV > >>>>>>> <https://stackoverflow.com/cv/taylor> | LinkedIn > >>>>>>> <https://www.linkedin.com/in/tedmiston/> | AngelList > >>>>>>> <https://angel.co/taylor> | Stack Overflow > >>>>>>> <https://stackoverflow.com/users/149428/taylor-edmiston> > >>>>>>> > >>>>>>> > >>>>>>> On Thu, Jul 26, 2018 at 5:18 PM, Driesprong, Fokko > >>> <fo...@driesprong.frl > >>>>>> > >>>>>>> wrote: > >>>>>>> > >>>>>>>> Hi Ry, > >>>>>>>> > >>>>>>>> You should ask Bolke de Bruin. He's really experienced with > Kerberos > >>>>> and > >>>>>>> he > >>>>>>>> did also the implementation for Airflow. Beside that he worked > also > >>> on > >>>>>>>> implementing Kerberos in Ambari. Just want to let you know. > >>>>>>>> > >>>>>>>> Cheers, Fokko > >>>>>>>> > >>>>>>>> Op do 26 jul. 2018 om 23:03 schreef Ry Walker <r...@astronomer.io> > >>>>>>>> > >>>>>>>>> Hi everyone - > >>>>>>>>> > >>>>>>>>> We have several bigCo's who are considering using Airflow asking > >>> into > >>>>>>> its > >>>>>>>>> support for Kerberos. > >>>>>>>>> > >>>>>>>>> We're going to work on a proof-of-concept next week, will likely > >>>>>>> record a > >>>>>>>>> screencast on it. > >>>>>>>>> > >>>>>>>>> For now, we're looking for any anecdotal information from > >>>>> organizations > >>>>>>>> who > >>>>>>>>> are using Kerberos with Airflow, if anyone would be willing to > share > >>>>>>>> their > >>>>>>>>> experiences here, or reply to me personally, it would be greatly > >>>>>>>>> appreciated! > >>>>>>>>> > >>>>>>>>> -Ry > >>>>>>>>> > >>>>>>>>> -- > >>>>>>>>> > >>>>>>>>> *Ry Walker* | CEO, Astronomer <http://www.astronomer.io/> | > >>>>>>>> 513.417.2163 | > >>>>>>>>> @rywalker <http://twitter.com/rywalker> | LinkedIn > >>>>>>>>> <http://www.linkedin.com/in/rywalker> > >>>>>>> > >>>>> > >>> > > > >