I’m not sure what you mean. The example I created allows for dynamic DAGs, as the scheduler obviously knows about the tasks when they are ready to be scheduled. This isn’t any different from a static DAG or a dynamic one.
For Kerberos it isnt that special. Basically a keytab are the revokable users credentials in a special format. The keytab itself can be protected by a password. So I can imagine that a connection is defined that sets a keytab location and password to access the keytab. The scheduler understands this (or maybe the Connection model) and serializes and sends it to the worker as part of the metadata. The worker then reconstructs the keytab and issues a kinit or supplies it to the other service requiring it (eg. Spark) * Obviously the worker and scheduler need to communicate over SSL. * There is a challenge at the worker level. Credentials are secured against other users, but are readable by the owning user. So imagine 2 DAGs from two different users with different connections without sudo configured. If they end up at the same worker if DAG 2 is malicious it could read files and memory created by DAG 1. This is the reason why using environment variables are NOT safe (DAG 2 could read /proc/<pid>/environ). To mitigate this we probably need to PIPE the data to the task’s STDIN. It won’t solve the issue but will make it harder as now it will only be in memory. * The reconstructed keytab (or the initalized version) can be stored in, most likely, the process-keyring (http://man7.org/linux/man-pages/man7/process-keyring.7.html <http://man7.org/linux/man-pages/man7/process-keyring.7.html>). As mentioned earlier this poses a challenge for Java applications that cannot read from this location (keytab an ccache). Writing it out to the filesystem then becomes a possibility. This is essentially the same how Spark solves it (https://spark.apache.org/docs/latest/security.html#yarn-mode <https://spark.apache.org/docs/latest/security.html#yarn-mode>). Why not work on this together? We need it as well. Airflow as it is now we consider the biggest security threat and it is really hard to secure it. The above would definitely be a serious improvement. Another step would be to stop Tasks from accessing the Airflow DB all together. Cheers Bolke > On 29 Jul 2018, at 05:36, Dan Davydov <ddavy...@twitter.com.INVALID> wrote: > > This makes sense, and thanks for putting this together. I might pick this > up myself depending on if we can get the rest of the mutli-tenancy story > nailed down, but I still think the tricky part is figuring out how to allow > dynamic DAGs (e.g. DAGs created from rows in a Mysql table) to work with > Kerberos, curious what your thoughts are there. How would secrets be passed > securely in a multi-tenant Scheduler starting from parsing the DAGs up to > the executor sending them off? > > On Sat, Jul 28, 2018 at 5:07 PM Bolke de Bruin <bdbr...@gmail.com > <mailto:bdbr...@gmail.com>> wrote: > >> Here: >> >> https://github.com/bolkedebruin/airflow/tree/secure_connections >> <https://github.com/bolkedebruin/airflow/tree/secure_connections> < >> https://github.com/bolkedebruin/airflow/tree/secure_connections >> <https://github.com/bolkedebruin/airflow/tree/secure_connections>> >> >> Is a working rudimentary implementation that allows securing the >> connections (only LocalExecutor at the moment) >> >> * It enforces the use of “conn_id” instead of the mix that we have now >> * A task if using “conn_id” has ‘auto-registered’ (which is a noop) its >> connections >> * The scheduler reads the connection informations and serializes it to >> json (which should be a different format, protobuf preferably) >> * The scheduler then sends this info to the executor >> * The executor puts this in the environment of the task (environment most >> likely not secure enough for us) >> * The BaseHook reads out this environment variable and does not need to >> touch the database >> >> The example_http_operator works, I havent tested any other. To make it >> work I just adjusted the hook and operator to use “conn_id” instead >> of the non standard http_conn_id. >> >> Makes sense? >> >> B. >> >> * The BaseHook is adjusted to not connect to the database >>> On 28 Jul 2018, at 17:50, Bolke de Bruin <bdbr...@gmail.com> wrote: >>> >>> Well, I don’t think a hook (or task) should be obtain it by itself. It >> should be supplied. >>> At the moment you start executing the task you cannot trust it anymore >> (ie. it is unmanaged >>> / non airflow code). >>> >>> So we could change the basehook to understand supplied credentials and >> populate >>> a hash with “conn_ids”. Hooks normally call BaseHook.get_connection >> anyway, so >>> it shouldnt be too hard and should in principle not require changes to >> the hooks >>> themselves if they are well behaved. >>> >>> B. >>> >>>> On 28 Jul 2018, at 17:41, Dan Davydov <ddavy...@twitter.com.INVALID >> <mailto:ddavy...@twitter.com.INVALID <mailto:ddavy...@twitter.com.INVALID>>> >> wrote: >>>> >>>> *So basically in the scheduler we parse the dag. Either from the >> manifest >>>> (new) or from smart parsing (probably harder, maybe some auto >> register?) we >>>> know what connections and keytabs are available dag wide or per task.* >>>> This is the hard part that I was curious about, for dynamically created >>>> DAGs, e.g. those generated by reading tasks in a MySQL database or a >> json >>>> file, there isn't a great way to do this. >>>> >>>> I 100% agree with deprecating the connections table (at least for the >>>> secure option). The main work there is rewriting all hooks to take >>>> credentials from arbitrary data sources by allowing a customized >>>> CredentialsReader class. Although hooks are technically private, I >> think a >>>> lot of companies depend on them so the PMC should probably discuss if >> this >>>> is an Airflow 2.0 change or not. >>>> >>>> On Fri, Jul 27, 2018 at 5:24 PM Bolke de Bruin <bdbr...@gmail.com >>>> <mailto:bdbr...@gmail.com> >> <mailto:bdbr...@gmail.com <mailto:bdbr...@gmail.com>>> wrote: >>>> >>>>> Sure. In general I consider keytabs as a part of connection >> information. >>>>> Connections should be secured by sending the connection information a >> task >>>>> needs as part of information the executor gets. A task should then not >> need >>>>> access to the connection table in Airflow. Keytabs could then be send >> as >>>>> part of the connection information (base64 encoded) and setup by the >>>>> executor (this key) to be read only to the task it is launching. >>>>> >>>>> So basically in the scheduler we parse the dag. Either from the >> manifest >>>>> (new) or from smart parsing (probably harder, maybe some auto >> register?) we >>>>> know what connections and keytabs are available dag wide or per task. >>>>> >>>>> The credentials and connection information then are serialized into a >>>>> protobuf message and send to the executor as part of the “queue” >> action. >>>>> The worker then deserializes the information and makes it securely >>>>> available to the task (which is quite hard btw). >>>>> >>>>> On that last bit making the info securely available might be storing >> it in >>>>> the Linux KEYRING (supported by python keyring). Keytabs will be tough >> to >>>>> do properly due to Java not properly supporting KEYRING and only files >> and >>>>> these are hard to make secure (due to the possibility a process will >> list >>>>> all files in /tmp and get credentials through that). Maybe storing the >>>>> keytab with a password and having the password in the KEYRING might >> work. >>>>> Something to find out. >>>>> >>>>> B. >>>>> >>>>> Verstuurd vanaf mijn iPad >>>>> >>>>>> Op 27 jul. 2018 om 22:04 heeft Dan Davydov >> <ddavy...@twitter.com.INVALID <mailto:ddavy...@twitter.com.INVALID> >> <mailto:ddavy...@twitter.com.INVALID <mailto:ddavy...@twitter.com.INVALID>>> >>>>> het volgende geschreven: >>>>>> >>>>>> I'm curious if you had any ideas in terms of ideas to enable >>>>> multi-tenancy >>>>>> with respect to Kerberos in Airflow. >>>>>> >>>>>>> On Fri, Jul 27, 2018 at 2:38 PM Bolke de Bruin <bdbr...@gmail.com >>>>>>> <mailto:bdbr...@gmail.com> >> <mailto:bdbr...@gmail.com <mailto:bdbr...@gmail.com>>> >>>>> wrote: >>>>>>> >>>>>>> Cool. The doc will need some refinement as it isn't entirely >> accurate. >>>>> In >>>>>>> addition we need to separate between Airflow as a client of >> kerberized >>>>>>> services (this is what is talked about in the astronomer doc) vs >>>>>>> kerberizing airflow itself, which the API supports. >>>>>>> >>>>>>> In general to access kerberized services (airflow as a client) one >> needs >>>>>>> to start the ticket renewer with a valid keytab. For the hooks it >> isn't >>>>>>> always required to change the hook to support it. Hadoop cli tools >> often >>>>>>> just pick it up as their client config is set to do so. Then another >>>>> class >>>>>>> is there for HTTP-like services which are accessed by urllib under >> the >>>>>>> hood, these typically use SPNEGO. These often need to be adjusted as >> it >>>>>>> requires some urllib config. Finally, there are protocols which use >> SASL >>>>>>> with kerberos. Like HDFS (not webhdfs, that uses SPNEGO). These >> require >>>>> per >>>>>>> protocol implementations. >>>>>>> >>>>>>> From the top of my head we support kerberos client side now with: >>>>>>> >>>>>>> * Spark >>>>>>> * HDFS (snakebite python 2.7, cli and with the upcoming libhdfs >>>>>>> implementation) >>>>>>> * Hive (not metastore afaik) >>>>>>> >>>>>>> Two things to remember: >>>>>>> >>>>>>> * If a job (ie. Spark job) will finish later than the maximum ticket >>>>>>> lifetime you probably need to provide a keytab to said application. >>>>>>> Otherwise you will get failures after the expiry. >>>>>>> * A keytab (used by the renewer) are credentials (user and pass) so >> jobs >>>>>>> are executed under the keytab in use at that moment >>>>>>> * Securing keytab in multi tenancy airflow is a challenge. This also >>>>> goes >>>>>>> for securing connections. This we need to fix at some point. Solution >>>>> for >>>>>>> now seems to be no multi tenancy. >>>>>>> >>>>>>> Kerberos seems harder than it is btw. Still, we are sometimes moving >>>>> away >>>>>>> from it to OAUTH2 based authentication. This gets use closer to cloud >>>>>>> standards (but we are on prem) >>>>>>> >>>>>>> B. >>>>>>> >>>>>>> Sent from my iPhone >>>>>>> >>>>>>>> On 27 Jul 2018, at 17:41, Hitesh Shah <hit...@apache.org >>>>>>>> <mailto:hit...@apache.org> <mailto: >> hit...@apache.org <mailto:hit...@apache.org>>> wrote: >>>>>>>> >>>>>>>> Hi Taylor >>>>>>>> >>>>>>>> +1 on upstreaming this. It would be great if you can submit a pull >>>>>>> request >>>>>>>> to enhance the apache airflow docs. >>>>>>>> >>>>>>>> thanks >>>>>>>> Hitesh >>>>>>>> >>>>>>>> >>>>>>>>> On Thu, Jul 26, 2018 at 2:32 PM Taylor Edmiston < >> tedmis...@gmail.com <mailto:tedmis...@gmail.com> <mailto:tedmis...@gmail.com >> <mailto:tedmis...@gmail.com>>> >>>>>>> wrote: >>>>>>>>> >>>>>>>>> While we're on the topic, I'd love any feedback from Bolke or >> others >>>>>>> who've >>>>>>>>> used Kerberos with Airflow on this quick guide I put together >>>>> yesterday. >>>>>>>>> It's similar to what's in the Airflow docs but instead all on one >> page >>>>>>>>> and slightly >>>>>>>>> expanded. >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>> >>>>> >> https://github.com/astronomerio/airflow-guides/blob/master/guides/kerberos.md >> >> <https://github.com/astronomerio/airflow-guides/blob/master/guides/kerberos.md> >> < >> https://github.com/astronomerio/airflow-guides/blob/master/guides/kerberos.md >> >> <https://github.com/astronomerio/airflow-guides/blob/master/guides/kerberos.md> >>> >>>>>>>>> (or web version <https://www.astronomer.io/guides/kerberos/ >>>>>>>>> <https://www.astronomer.io/guides/kerberos/>>) >>>>>>>>> >>>>>>>>> One thing I'd like to add is a minimal example of how to Kerberize >> a >>>>>>> hook. >>>>>>>>> >>>>>>>>> I'd be happy to upstream this as well if it's useful (maybe a >>>>> Concepts > >>>>>>>>> Additional Functionality > Kerberos page?) >>>>>>>>> >>>>>>>>> Best, >>>>>>>>> Taylor >>>>>>>>> >>>>>>>>> >>>>>>>>> *Taylor Edmiston* >>>>>>>>> Blog <https://blog.tedmiston.com/ <https://blog.tedmiston.com/>> | CV >>>>>>>>> <https://stackoverflow.com/cv/taylor >>>>>>>>> <https://stackoverflow.com/cv/taylor>> | LinkedIn >>>>>>>>> <https://www.linkedin.com/in/tedmiston/ >>>>>>>>> <https://www.linkedin.com/in/tedmiston/>> | AngelList >>>>>>>>> <https://angel.co/taylor <https://angel.co/taylor>> | Stack Overflow >>>>>>>>> <https://stackoverflow.com/users/149428/taylor-edmiston >>>>>>>>> <https://stackoverflow.com/users/149428/taylor-edmiston>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On Thu, Jul 26, 2018 at 5:18 PM, Driesprong, Fokko >>>>> <fo...@driesprong.frl <mailto:fo...@driesprong.frl> >>>>>>>> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> Hi Ry, >>>>>>>>>> >>>>>>>>>> You should ask Bolke de Bruin. He's really experienced with >> Kerberos >>>>>>> and >>>>>>>>> he >>>>>>>>>> did also the implementation for Airflow. Beside that he worked >> also >>>>> on >>>>>>>>>> implementing Kerberos in Ambari. Just want to let you know. >>>>>>>>>> >>>>>>>>>> Cheers, Fokko >>>>>>>>>> >>>>>>>>>> Op do 26 jul. 2018 om 23:03 schreef Ry Walker <r...@astronomer.io >>>>>>>>>> <mailto:r...@astronomer.io>> >>>>>>>>>> >>>>>>>>>>> Hi everyone - >>>>>>>>>>> >>>>>>>>>>> We have several bigCo's who are considering using Airflow asking >>>>> into >>>>>>>>> its >>>>>>>>>>> support for Kerberos. >>>>>>>>>>> >>>>>>>>>>> We're going to work on a proof-of-concept next week, will likely >>>>>>>>> record a >>>>>>>>>>> screencast on it. >>>>>>>>>>> >>>>>>>>>>> For now, we're looking for any anecdotal information from >>>>>>> organizations >>>>>>>>>> who >>>>>>>>>>> are using Kerberos with Airflow, if anyone would be willing to >> share >>>>>>>>>> their >>>>>>>>>>> experiences here, or reply to me personally, it would be greatly >>>>>>>>>>> appreciated! >>>>>>>>>>> >>>>>>>>>>> -Ry >>>>>>>>>>> >>>>>>>>>>> -- >>>>>>>>>>> >>>>>>>>>>> *Ry Walker* | CEO, Astronomer <http://www.astronomer.io/ >>>>>>>>>>> <http://www.astronomer.io/>> | >>>>>>>>>> 513.417.2163 | >>>>>>>>>>> @rywalker <http://twitter.com/rywalker >>>>>>>>>>> <http://twitter.com/rywalker>> | LinkedIn >>>>>>>>>>> <http://www.linkedin.com/in/rywalker >>>>>>>>>>> <http://www.linkedin.com/in/rywalker>>