Re: Kerberos and Airflow

Dan Davydov Thu, 02 Aug 2018 11:13:31 -0700

I'm very intrigued, and am curious how this would work in a bit more
detail, especially for dynamically created DAGs (how would static manifests
map to DAGs that are generated from rows in a MySQL table for example)? You
could of course have something like regexes in your manifest file like
some_dag_framework_dag_*, but then how would you make sure that other users
did not create DAGs that matched this regex?


On Thu, Aug 2, 2018 at 1:51 PM Bolke de Bruin <bdbr...@gmail.com> wrote:

> Hi Dan,
>
> I discussed this a little bit with one of the security architects here. We
> think that
> you can have a fair trade off between security and usability by having
> a kind of manifest with the dag you are submitting. This manifest can then
> specify what the generated tasks/dags are allowed to do and what metadata
> to provide to them. We could also let the scheduler generate hashes per
> generated
> DAG / task and verify those with an established version (1st run?). This
> limits the
> attack vector.
>
> A DagSerializer would be great, but I think it solves a different issue
> and the above
> is somewhat simpler to implement?
>
> Bolke
>
> > On 29 Jul 2018, at 23:47, Dan Davydov <ddavy...@twitter.com.INVALID>
> wrote:
> >
> > *Let’s say we trust the owner field of the DAGs I think we could do the
> > following.*
> > *Obviously, the trusting the user part is key here. It is one of the
> > reasons I was suggesting using “airflow submit” to update / add dags in
> > Airflow*
> >
> >
> > *This is the hard part about my question.*
> > I think in a true multi-tenant environment we wouldn't be able to trust
> the
> > user, otherwise we wouldn't necessarily even need a mapping of Airflow
> DAG
> > users to secrets, because if we trust users to set the correct Airflow
> user
> > for DAGs, we are basically trusting them with all of the creds the
> Airflow
> > scheduler can access for all users anyways.
> >
> > I actually had the same thought as your "airflow submit" a while ago,
> which
> > I discussed with Alex, basically creating an API for adding DAGs instead
> of
> > having the Scheduler parse them. FWIW I think it's superior to the git
> time
> > machine approach because it's a more generic form of "serialization" and
> is
> > more correct as well because the same DAG file parsed on a given git SHA
> > can produce different DAGs. Let me know what you think, and maybe I can
> > start a more formal design doc if you are onboard:
> >
> > A user or service with an auth token sends an "airflow submit" request
> to a
> > new kind of Dag Serialization service, along with the serialized DAG
> > objects generated by parsing on the client. It's important that these
> > serialized objects are declaritive and not e.g. pickles so that the
> > scheduler/workers can consume them and reproducability of the DAGs is
> > guaranteed. The service will then store each generated DAG along with
> it's
> > access based on the provided token e.g. using Ranger, and the
> > scheduler/workers will use the stored DAGs for scheduling/execution.
> > Operators would be deployed along with the Airflow code separately from
> the
> > serialized DAGs.
> >
> > A serialed DAG would look something like this (basically Luigi-style :)):
> > MyTask - BashOperator: {
> >  cmd: "sleep 1"
> >  user: "Foo"
> >  access: "token1", "token2"
> > }
> >
> > MyDAG: {
> >  MyTask1 >> SomeOtherTask1
> >  MyTask2 >> SomeOtherTask1
> > }
> >
> > Dynamic DAGs in this case would just consist of a service calling
> "Airflow
> > Submit" that does it's own form of authentication to get access to some
> > kind of tokens (or basically just forwarding the secrets the users of the
> > dynamic DAG submit).
> >
> > For the default Airflow implementation you can maybe just have the Dag
> > Serialization server bundled with the Scheduler, with auth turned off,
> and
> > to periodically update the Dag Serialization store which would emulate
> the
> > current behavior closely.
> >
> > Pros:
> > 1. Consistency across running task instances in a dagrun/scheduler,
> > reproducability and auditability of DAGs
> > 2. Users can control when to deploy their DAGs
> > 3. Scheduler runs much faster since it doesn't have to run python files
> and
> > e.g. make network calls
> > 4. Scaling scheduler becomes easier because can have different service
> > responsible for parsing DAGs which can be trivially scaled horizontally
> > (clients are doing the parsing)
> > 5. Potentially makes creating ad-hoc DAGs/backfilling/iterating on DAGs
> > easier? e.g. can use the Scheduler itself to schedule backfills with a
> > slightly modified serialized version of a DAG.
> >
> > Cons:
> > 1. Have to deprecate a lot of popular features, e.g. allowing custom
> > callbacks in operators (e.g. on_failure), and jinja_templates
> > 2. Version compatibility problems, e.g. user/service client might be
> > serializing arguments for hooks/operators that have been deprecated in
> > newer versions of the hooks, or the serialized DAG schema changes and old
> > DAGs aren't automatically updated. Might want to have some kind of
> > versioning system for serialized DAGs to at least ensure that stored DAGs
> > are valid when the Scheduler/Worker/etc are upgraded, maybe something
> > similar to thrift/protobuf versioning.
> > 3. Additional complexity - additional service, logic on workers/scheduler
> > to fetch/cache serialized DAGs efficiently, expiring/archiving old DAG
> > definitions, etc
> >
> >
> > On Sun, Jul 29, 2018 at 3:20 PM Bolke de Bruin <bdbr...@gmail.com
> <mailto:bdbr...@gmail.com>> wrote:
> >
> >> Ah gotcha. That’s another issue actually (but related).
> >>
> >> Let’s say we trust the owner field of the DAGs I think we could do the
> >> following. We then have a table (and interface) to tell Airflow what
> users
> >> have access to what connections. The scheduler can then check if the
> task
> >> in the dag can access the conn_id it is asking for. Auto generated dags
> >> still have an owner (or should) and therefore should be fine. Some
> >> integrity checking could/should be added as we want to be sure that the
> >> task we schedule is the task we launch. So a signature calculated at the
> >> scheduler (or part of the DAG), send as part of the metadata and
> checked by
> >> the executor is probably smart.
> >>
> >> You can also make this more fancy by integrating with something like
> >> Apache Ranger that allows for policy checking.
> >>
> >> Obviously, the trusting the user part is key here. It is one of the
> >> reasons I was suggesting using “airflow submit” to update / add dags in
> >> Airflow. We could enforce authentication on the DAG. It was kind of
> ruled
> >> out in favor of git time machines although these never happened afaik
> ;-).
> >>
> >> BTW: I have updated my implementation with protobuf. Metadata is now
> >> available at executor and task.
> >>
> >>
> >>> On 29 Jul 2018, at 15:47, Dan Davydov <ddavy...@twitter.com.INVALID>
> >> wrote:
> >>>
> >>> The concern is how to secure secrets on the scheduler such that only
> >>> certain DAGs can access them, and in the case of files that create DAGs
> >>> dynamically, only some set of DAGs should be able to access these
> >> secrets.
> >>>
> >>> e.g. if there is a secret/keytab that can be read by DAG A generated by
> >>> file X, and file X generates DAG B as well, there needs to be a scheme
> to
> >>> stop the parsing of DAG B on the scheduler from being able to read the
> >>> secret in DAG A.
> >>>
> >>> Does that make sense?
> >>>
> >>> On Sun, Jul 29, 2018 at 6:14 AM Bolke de Bruin <bdbr...@gmail.com
> >> <mailto:bdbr...@gmail.com <mailto:bdbr...@gmail.com>>> wrote:
> >>>
> >>>> I’m not sure what you mean. The example I created allows for dynamic
> >> DAGs,
> >>>> as the scheduler obviously knows about the tasks when they are ready
> to
> >> be
> >>>> scheduled.
> >>>> This isn’t any different from a static DAG or a dynamic one.
> >>>>
> >>>> For Kerberos it isnt that special. Basically a keytab are the
> revokable
> >>>> users credentials
> >>>> in a special format. The keytab itself can be protected by a password.
> >> So
> >>>> I can imagine
> >>>> that a connection is defined that sets a keytab location and password
> to
> >>>> access the keytab.
> >>>> The scheduler understands this (or maybe the Connection model) and
> >>>> serializes and sends
> >>>> it to the worker as part of the metadata. The worker then reconstructs
> >> the
> >>>> keytab and issues
> >>>> a kinit or supplies it to the other service requiring it (eg. Spark)
> >>>>
> >>>> * Obviously the worker and scheduler need to communicate over SSL.
> >>>> * There is a challenge at the worker level. Credentials are secured
> >>>> against other users, but are readable by the owning user. So imagine 2
> >> DAGs
> >>>> from two different users with different connections without sudo
> >>>> configured. If they end up at the same worker if DAG 2 is malicious it
> >>>> could read files and memory created by DAG 1. This is the reason why
> >> using
> >>>> environment variables are NOT safe (DAG 2 could read
> >> /proc/<pid>/environ).
> >>>> To mitigate this we probably need to PIPE the data to the task’s
> STDIN.
> >> It
> >>>> won’t solve the issue but will make it harder as now it will only be
> in
> >>>> memory.
> >>>> * The reconstructed keytab (or the initalized version) can be stored
> in,
> >>>> most likely, the process-keyring (
> >>>> http://man7.org/linux/man-pages/man7/process-keyring.7.html <
> http://man7.org/linux/man-pages/man7/process-keyring.7.html> <
> >>>> http://man7.org/linux/man-pages/man7/process-keyring.7.html <
> http://man7.org/linux/man-pages/man7/process-keyring.7.html> <
> >> http://man7.org/linux/man-pages/man7/process-keyring.7.html <
> http://man7.org/linux/man-pages/man7/process-keyring.7.html>>>). As
> >>>> mentioned earlier this poses a challenge for Java applications that
> >> cannot
> >>>> read from this location (keytab an ccache). Writing it out to the
> >>>> filesystem then becomes a possibility. This is essentially the same
> how
> >>>> Spark solves it (
> >>>> https://spark.apache.org/docs/latest/security.html#yarn-mode <
> https://spark.apache.org/docs/latest/security.html#yarn-mode> <
> >> https://spark.apache.org/docs/latest/security.html#yarn-mode <
> https://spark.apache.org/docs/latest/security.html#yarn-mode>> <
> >>>> https://spark.apache.org/docs/latest/security.html#yarn-mode <
> https://spark.apache.org/docs/latest/security.html#yarn-mode> <
> >> https://spark.apache.org/docs/latest/security.html#yarn-mode <
> https://spark.apache.org/docs/latest/security.html#yarn-mode>>>).
> >>>>
> >>>> Why not work on this together? We need it as well. Airflow as it is
> now
> >> we
> >>>> consider the biggest security threat and it is really hard to secure
> it.
> >>>> The above would definitely be a serious improvement. Another step
> would
> >> be
> >>>> to stop Tasks from accessing the Airflow DB all together.
> >>>>
> >>>> Cheers
> >>>> Bolke
> >>>>
> >>>>> On 29 Jul 2018, at 05:36, Dan Davydov <ddavy...@twitter.com.INVALID
> <mailto:ddavy...@twitter.com.INVALID>
> >> <mailto:ddavy...@twitter.com.INVALID <mailto:
> ddavy...@twitter.com.INVALID>>>
> >>>> wrote:
> >>>>>
> >>>>> This makes sense, and thanks for putting this together. I might pick
> >> this
> >>>>> up myself depending on if we can get the rest of the mutli-tenancy
> >> story
> >>>>> nailed down, but I still think the tricky part is figuring out how to
> >>>> allow
> >>>>> dynamic DAGs (e.g. DAGs created from rows in a Mysql table) to work
> >> with
> >>>>> Kerberos, curious what your thoughts are there. How would secrets be
> >>>> passed
> >>>>> securely in a multi-tenant Scheduler starting from parsing the DAGs
> up
> >> to
> >>>>> the executor sending them off?
> >>>>>
> >>>>> On Sat, Jul 28, 2018 at 5:07 PM Bolke de Bruin <bdbr...@gmail.com
> <mailto:bdbr...@gmail.com>
> >> <mailto:bdbr...@gmail.com <mailto:bdbr...@gmail.com>>
> >>>> <mailto:bdbr...@gmail.com <mailto:bdbr...@gmail.com> <mailto:
> bdbr...@gmail.com <mailto:bdbr...@gmail.com>>>> wrote:
> >>>>>
> >>>>>> Here:
> >>>>>>
> >>>>>> https://github.com/bolkedebruin/airflow/tree/secure_connections <
> https://github.com/bolkedebruin/airflow/tree/secure_connections> <
> >> https://github.com/bolkedebruin/airflow/tree/secure_connections <
> https://github.com/bolkedebruin/airflow/tree/secure_connections>> <
> >>>> https://github.com/bolkedebruin/airflow/tree/secure_connections <
> https://github.com/bolkedebruin/airflow/tree/secure_connections> <
> >> https://github.com/bolkedebruin/airflow/tree/secure_connections <
> https://github.com/bolkedebruin/airflow/tree/secure_connections>>> <
> >>>>>> https://github.com/bolkedebruin/airflow/tree/secure_connections <
> https://github.com/bolkedebruin/airflow/tree/secure_connections> <
> >> https://github.com/bolkedebruin/airflow/tree/secure_connections <
> https://github.com/bolkedebruin/airflow/tree/secure_connections>> <
> >>>> https://github.com/bolkedebruin/airflow/tree/secure_connections <
> https://github.com/bolkedebruin/airflow/tree/secure_connections> <
> >> https://github.com/bolkedebruin/airflow/tree/secure_connections <
> https://github.com/bolkedebruin/airflow/tree/secure_connections>>>>
> >>>>>>
> >>>>>> Is a working rudimentary implementation that allows securing the
> >>>>>> connections (only LocalExecutor at the moment)
> >>>>>>
> >>>>>> * It enforces the use of “conn_id” instead of the mix that we have
> now
> >>>>>> * A task if using “conn_id” has ‘auto-registered’ (which is a noop)
> >> its
> >>>>>> connections
> >>>>>> * The scheduler reads the connection informations and serializes it
> to
> >>>>>> json (which should be a different format, protobuf preferably)
> >>>>>> * The scheduler then sends this info to the executor
> >>>>>> * The executor puts this in the environment of the task (environment
> >>>> most
> >>>>>> likely not secure enough for us)
> >>>>>> * The BaseHook reads out this environment variable and does not need
> >> to
> >>>>>> touch the database
> >>>>>>
> >>>>>> The example_http_operator works, I havent tested any other. To make
> it
> >>>>>> work I just adjusted the hook and operator to use “conn_id” instead
> >>>>>> of the non standard http_conn_id.
> >>>>>>
> >>>>>> Makes sense?
> >>>>>>
> >>>>>> B.
> >>>>>>
> >>>>>> * The BaseHook is adjusted to not connect to the database
> >>>>>>> On 28 Jul 2018, at 17:50, Bolke de Bruin <bdbr...@gmail.com
> <mailto:bdbr...@gmail.com> <mailto:
> >> bdbr...@gmail.com <mailto:bdbr...@gmail.com>>> wrote:
> >>>>>>>
> >>>>>>> Well, I don’t think a hook (or task) should be obtain it by itself.
> >> It
> >>>>>> should be supplied.
> >>>>>>> At the moment you start executing the task you cannot trust it
> >> anymore
> >>>>>> (ie. it is unmanaged
> >>>>>>> / non airflow code).
> >>>>>>>
> >>>>>>> So we could change the basehook to understand supplied credentials
> >> and
> >>>>>> populate
> >>>>>>> a hash with “conn_ids”. Hooks normally call BaseHook.get_connection
> >>>>>> anyway, so
> >>>>>>> it shouldnt be too hard and should in principle not require changes
> >> to
> >>>>>> the hooks
> >>>>>>> themselves if they are well behaved.
> >>>>>>>
> >>>>>>> B.
> >>>>>>>
> >>>>>>>> On 28 Jul 2018, at 17:41, Dan Davydov
> <ddavy...@twitter.com.INVALID <mailto:ddavy...@twitter.com.INVALID>
> >> <mailto:ddavy...@twitter.com.INVALID <mailto:
> ddavy...@twitter.com.INVALID>>
> >>>>>> <mailto:ddavy...@twitter.com.INVALID <mailto:
> ddavy...@twitter.com.INVALID> <mailto:
> >> ddavy...@twitter.com.INVALID <mailto:ddavy...@twitter.com.INVALID>>
> <mailto:
> >>>> ddavy...@twitter.com.INVALID <mailto:ddavy...@twitter.com.INVALID>
> <mailto:ddavy...@twitter.com.INVALID <mailto:ddavy...@twitter.com.INVALID
> >>>>>
> >> wrote:
> >>>>>>>>
> >>>>>>>> *So basically in the scheduler we parse the dag. Either from the
> >>>>>> manifest
> >>>>>>>> (new) or from smart parsing (probably harder, maybe some auto
> >>>>>> register?) we
> >>>>>>>> know what connections and keytabs are available dag wide or per
> >> task.*
> >>>>>>>> This is the hard part that I was curious about, for dynamically
> >>>> created
> >>>>>>>> DAGs, e.g. those generated by reading tasks in a MySQL database
> or a
> >>>>>> json
> >>>>>>>> file, there isn't a great way to do this.
> >>>>>>>>
> >>>>>>>> I 100% agree with deprecating the connections table (at least for
> >> the
> >>>>>>>> secure option). The main work there is rewriting all hooks to take
> >>>>>>>> credentials from arbitrary data sources by allowing a customized
> >>>>>>>> CredentialsReader class. Although hooks are technically private, I
> >>>>>> think a
> >>>>>>>> lot of companies depend on them so the PMC should probably discuss
> >> if
> >>>>>> this
> >>>>>>>> is an Airflow 2.0 change or not.
> >>>>>>>>
> >>>>>>>> On Fri, Jul 27, 2018 at 5:24 PM Bolke de Bruin <bdbr...@gmail.com
> <mailto:bdbr...@gmail.com>
> >> <mailto:bdbr...@gmail.com <mailto:bdbr...@gmail.com>>
> >>>> <mailto:bdbr...@gmail.com <mailto:bdbr...@gmail.com> <mailto:
> bdbr...@gmail.com <mailto:bdbr...@gmail.com>>>
> >>>>>> <mailto:bdbr...@gmail.com <mailto:bdbr...@gmail.com> <mailto:
> bdbr...@gmail.com <mailto:bdbr...@gmail.com>> <mailto:
> >> bdbr...@gmail.com <mailto:bdbr...@gmail.com> <mailto:bdbr...@gmail.com
> <mailto:bdbr...@gmail.com>>>>> wrote:
> >>>>>>>>
> >>>>>>>>> Sure. In general I consider keytabs as a part of connection
> >>>>>> information.
> >>>>>>>>> Connections should be secured by sending the connection
> >> information a
> >>>>>> task
> >>>>>>>>> needs as part of information the executor gets. A task should
> then
> >>>> not
> >>>>>> need
> >>>>>>>>> access to the connection table in Airflow. Keytabs could then be
> >> send
> >>>>>> as
> >>>>>>>>> part of the connection information (base64 encoded) and setup by
> >> the
> >>>>>>>>> executor (this key) to be read only to the task it is launching.
> >>>>>>>>>
> >>>>>>>>> So basically in the scheduler we parse the dag. Either from the
> >>>>>> manifest
> >>>>>>>>> (new) or from smart parsing (probably harder, maybe some auto
> >>>>>> register?) we
> >>>>>>>>> know what connections and keytabs are available dag wide or per
> >> task.
> >>>>>>>>>
> >>>>>>>>> The credentials and connection information then are serialized
> >> into a
> >>>>>>>>> protobuf message and send to the executor as part of the “queue”
> >>>>>> action.
> >>>>>>>>> The worker then deserializes the information and makes it
> securely
> >>>>>>>>> available to the task (which is quite hard btw).
> >>>>>>>>>
> >>>>>>>>> On that last bit making the info securely available might be
> >> storing
> >>>>>> it in
> >>>>>>>>> the Linux KEYRING (supported by python keyring). Keytabs will be
> >>>> tough
> >>>>>> to
> >>>>>>>>> do properly due to Java not properly supporting KEYRING and only
> >>>> files
> >>>>>> and
> >>>>>>>>> these are hard to make secure (due to the possibility a process
> >> will
> >>>>>> list
> >>>>>>>>> all files in /tmp and get credentials through that). Maybe
> storing
> >>>> the
> >>>>>>>>> keytab with a password and having the password in the KEYRING
> might
> >>>>>> work.
> >>>>>>>>> Something to find out.
> >>>>>>>>>
> >>>>>>>>> B.
> >>>>>>>>>
> >>>>>>>>> Verstuurd vanaf mijn iPad
> >>>>>>>>>
> >>>>>>>>>> Op 27 jul. 2018 om 22:04 heeft Dan Davydov
> >>>>>> <ddavy...@twitter.com.INVALID <mailto:ddavy...@twitter.com.INVALID>
> <mailto:ddavy...@twitter.com.INVALID <mailto:ddavy...@twitter.com.INVALID
> >>
> >> <mailto:ddavy...@twitter.com.INVALID <mailto:
> ddavy...@twitter.com.INVALID> <mailto:ddavy...@twitter.com.INVALID
> <mailto:ddavy...@twitter.com.INVALID>
> >>>>
> >>>> <mailto:ddavy...@twitter.com.INVALID <mailto:
> ddavy...@twitter.com.INVALID> <mailto:
> >> ddavy...@twitter.com.INVALID <mailto:ddavy...@twitter.com.INVALID>>
> <mailto:ddavy...@twitter.com.INVALID <mailto:ddavy...@twitter.com.INVALID>
> >> <mailto:ddavy...@twitter.com.INVALID <mailto:
> ddavy...@twitter.com.INVALID>>
> >>>>>>>
> >>>>>>>>> het volgende geschreven:
> >>>>>>>>>>
> >>>>>>>>>> I'm curious if you had any ideas in terms of ideas to enable
> >>>>>>>>> multi-tenancy
> >>>>>>>>>> with respect to Kerberos in Airflow.
> >>>>>>>>>>
> >>>>>>>>>>> On Fri, Jul 27, 2018 at 2:38 PM Bolke de Bruin <
> >> bdbr...@gmail.com <mailto:bdbr...@gmail.com> <mailto:bdbr...@gmail.com
> <mailto:bdbr...@gmail.com>>
> >>>> <mailto:bdbr...@gmail.com <mailto:bdbr...@gmail.com> <mailto:
> bdbr...@gmail.com <mailto:bdbr...@gmail.com>>>
> >>>>>> <mailto:bdbr...@gmail.com <mailto:bdbr...@gmail.com> <mailto:
> bdbr...@gmail.com <mailto:bdbr...@gmail.com>> <mailto:
> >> bdbr...@gmail.com <mailto:bdbr...@gmail.com> <mailto:bdbr...@gmail.com
> <mailto:bdbr...@gmail.com>>>>>
> >>>>>>>>> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>> Cool. The doc will need some refinement as it isn't entirely
> >>>>>> accurate.
> >>>>>>>>> In
> >>>>>>>>>>> addition we need to separate between Airflow as a client of
> >>>>>> kerberized
> >>>>>>>>>>> services (this is what is talked about in the astronomer doc)
> vs
> >>>>>>>>>>> kerberizing airflow itself, which the API supports.
> >>>>>>>>>>>
> >>>>>>>>>>> In general to access kerberized services (airflow as a client)
> >> one
> >>>>>> needs
> >>>>>>>>>>> to start the ticket renewer with a valid keytab. For the hooks
> it
> >>>>>> isn't
> >>>>>>>>>>> always required to change the hook to support it. Hadoop cli
> >> tools
> >>>>>> often
> >>>>>>>>>>> just pick it up as their client config is set to do so. Then
> >>>> another
> >>>>>>>>> class
> >>>>>>>>>>> is there for HTTP-like services which are accessed by urllib
> >> under
> >>>>>> the
> >>>>>>>>>>> hood, these typically use SPNEGO. These often need to be
> adjusted
> >>>> as
> >>>>>> it
> >>>>>>>>>>> requires some urllib config. Finally, there are protocols which
> >> use
> >>>>>> SASL
> >>>>>>>>>>> with kerberos. Like HDFS (not webhdfs, that uses SPNEGO). These
> >>>>>> require
> >>>>>>>>> per
> >>>>>>>>>>> protocol implementations.
> >>>>>>>>>>>
> >>>>>>>>>>> From the top of my head we support kerberos client side now
> with:
> >>>>>>>>>>>
> >>>>>>>>>>> * Spark
> >>>>>>>>>>> * HDFS (snakebite python 2.7, cli and with the upcoming libhdfs
> >>>>>>>>>>> implementation)
> >>>>>>>>>>> * Hive (not metastore afaik)
> >>>>>>>>>>>
> >>>>>>>>>>> Two things to remember:
> >>>>>>>>>>>
> >>>>>>>>>>> * If a job (ie. Spark job) will finish later than the maximum
> >>>> ticket
> >>>>>>>>>>> lifetime you probably need to provide a keytab to said
> >> application.
> >>>>>>>>>>> Otherwise you will get failures after the expiry.
> >>>>>>>>>>> * A keytab (used by the renewer) are credentials (user and
> pass)
> >> so
> >>>>>> jobs
> >>>>>>>>>>> are executed under the keytab in use at that moment
> >>>>>>>>>>> * Securing keytab in multi tenancy airflow is a challenge. This
> >>>> also
> >>>>>>>>> goes
> >>>>>>>>>>> for securing connections. This we need to fix at some point.
> >>>> Solution
> >>>>>>>>> for
> >>>>>>>>>>> now seems to be no multi tenancy.
> >>>>>>>>>>>
> >>>>>>>>>>> Kerberos seems harder than it is btw. Still, we are sometimes
> >>>> moving
> >>>>>>>>> away
> >>>>>>>>>>> from it to OAUTH2 based authentication. This gets use closer to
> >>>> cloud
> >>>>>>>>>>> standards (but we are on prem)
> >>>>>>>>>>>
> >>>>>>>>>>> B.
> >>>>>>>>>>>
> >>>>>>>>>>> Sent from my iPhone
> >>>>>>>>>>>
> >>>>>>>>>>>> On 27 Jul 2018, at 17:41, Hitesh Shah <hit...@apache.org
> <mailto:hit...@apache.org>
> >> <mailto:hit...@apache.org <mailto:hit...@apache.org>> <mailto:
> >>>> hit...@apache.org <mailto:hit...@apache.org> <mailto:
> hit...@apache.org <mailto:hit...@apache.org>>> <mailto:
> >>>>>> hit...@apache.org <mailto:hit...@apache.org> <mailto:
> hit...@apache.org <mailto:hit...@apache.org>> <mailto:
> >> hit...@apache.org <mailto:hit...@apache.org> <mailto:hit...@apache.org
> <mailto:hit...@apache.org>>>>> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>> Hi Taylor
> >>>>>>>>>>>>
> >>>>>>>>>>>> +1 on upstreaming this. It would be great if you can submit a
> >> pull
> >>>>>>>>>>> request
> >>>>>>>>>>>> to enhance the apache airflow docs.
> >>>>>>>>>>>>
> >>>>>>>>>>>> thanks
> >>>>>>>>>>>> Hitesh
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>> On Thu, Jul 26, 2018 at 2:32 PM Taylor Edmiston <
> >>>>>> tedmis...@gmail.com <mailto:tedmis...@gmail.com> <mailto:
> tedmis...@gmail.com <mailto:tedmis...@gmail.com>> <mailto:
> >> tedmis...@gmail.com <mailto:tedmis...@gmail.com> <mailto:
> tedmis...@gmail.com <mailto:tedmis...@gmail.com>>> <mailto:
> >>>> tedmis...@gmail.com <mailto:tedmis...@gmail.com> <mailto:
> tedmis...@gmail.com <mailto:tedmis...@gmail.com>> <mailto:
> >> tedmis...@gmail.com <mailto:tedmis...@gmail.com> <mailto:
> tedmis...@gmail.com <mailto:tedmis...@gmail.com>>>>>
> >>>>>>>>>>> wrote:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> While we're on the topic, I'd love any feedback from Bolke or
> >>>>>> others
> >>>>>>>>>>> who've
> >>>>>>>>>>>>> used Kerberos with Airflow on this quick guide I put together
> >>>>>>>>> yesterday.
> >>>>>>>>>>>>> It's similar to what's in the Airflow docs but instead all on
> >> one
> >>>>>> page
> >>>>>>>>>>>>> and slightly
> >>>>>>>>>>>>> expanded.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>
> >>>>>>
> >>>>
> >>
> https://github.com/astronomerio/airflow-guides/blob/master/guides/kerberos.md
> <
> https://github.com/astronomerio/airflow-guides/blob/master/guides/kerberos.md
> >
> >> <
> >>
> https://github.com/astronomerio/airflow-guides/blob/master/guides/kerberos.md
> <
> https://github.com/astronomerio/airflow-guides/blob/master/guides/kerberos.md
> >
> >>>
> >>>> <
> >>>>
> >>
> https://github.com/astronomerio/airflow-guides/blob/master/guides/kerberos.md
> <
> https://github.com/astronomerio/airflow-guides/blob/master/guides/kerberos.md
> >
> >> <
> >>
> https://github.com/astronomerio/airflow-guides/blob/master/guides/kerberos.md
> <
> https://github.com/astronomerio/airflow-guides/blob/master/guides/kerberos.md
> >
> >>>
> >>>>>
> >>>>>> <
> >>>>>>
> >>>>
> >>
> https://github.com/astronomerio/airflow-guides/blob/master/guides/kerberos.md
> <
> https://github.com/astronomerio/airflow-guides/blob/master/guides/kerberos.md
> >
> >> <
> >>
> https://github.com/astronomerio/airflow-guides/blob/master/guides/kerberos.md
> <
> https://github.com/astronomerio/airflow-guides/blob/master/guides/kerberos.md
> >
> >>>
> >>>> <
> >>>>
> >>
> https://github.com/astronomerio/airflow-guides/blob/master/guides/kerberos.md
> <
> https://github.com/astronomerio/airflow-guides/blob/master/guides/kerberos.md
> >
> >> <
> >>
> https://github.com/astronomerio/airflow-guides/blob/master/guides/kerberos.md
> <
> https://github.com/astronomerio/airflow-guides/blob/master/guides/kerberos.md
> >
> >>>
> >>>>>
> >>>>>>>
> >>>>>>>>>>>>> (or web version <https://www.astronomer.io/guides/kerberos/
> <https://www.astronomer.io/guides/kerberos/> <
> >> https://www.astronomer.io/guides/kerberos/ <
> https://www.astronomer.io/guides/kerberos/>> <
> >>>> https://www.astronomer.io/guides/kerberos/ <
> https://www.astronomer.io/guides/kerberos/> <
> >> https://www.astronomer.io/guides/kerberos/ <
> https://www.astronomer.io/guides/kerberos/>>>>)
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> One thing I'd like to add is a minimal example of how to
> >>>> Kerberize
> >>>>>> a
> >>>>>>>>>>> hook.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> I'd be happy to upstream this as well if it's useful (maybe a
> >>>>>>>>> Concepts >
> >>>>>>>>>>>>> Additional Functionality > Kerberos page?)
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Best,
> >>>>>>>>>>>>> Taylor
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> *Taylor Edmiston*
> >>>>>>>>>>>>> Blog <https://blog.tedmiston.com/ <
> https://blog.tedmiston.com/> <https://blog.tedmiston.com/ <
> https://blog.tedmiston.com/>>
> >> <https://blog.tedmiston.com/ <https://blog.tedmiston.com/> <
> https://blog.tedmiston.com/ <https://blog.tedmiston.com/>>>>
> >>>> | CV
> >>>>>>>>>>>>> <https://stackoverflow.com/cv/taylor <
> https://stackoverflow.com/cv/taylor> <
> >> https://stackoverflow.com/cv/taylor <
> https://stackoverflow.com/cv/taylor>> <
> >>>> https://stackoverflow.com/cv/taylor <
> https://stackoverflow.com/cv/taylor> <
> >> https://stackoverflow.com/cv/taylor <
> https://stackoverflow.com/cv/taylor>>>> | LinkedIn
> >>>>>>>>>>>>> <https://www.linkedin.com/in/tedmiston/ <
> https://www.linkedin.com/in/tedmiston/> <
> >> https://www.linkedin.com/in/tedmiston/ <
> https://www.linkedin.com/in/tedmiston/>> <
> >>>> https://www.linkedin.com/in/tedmiston/ <
> https://www.linkedin.com/in/tedmiston/> <
> >> https://www.linkedin.com/in/tedmiston/ <
> https://www.linkedin.com/in/tedmiston/>>>> | AngelList
> >>>>>>>>>>>>> <https://angel.co/taylor <https://angel.co/taylor> <
> https://angel.co/taylor <https://angel.co/taylor>> <
> >> https://angel.co/taylor <https://angel.co/taylor> <
> https://angel.co/taylor <https://angel.co/taylor>>>> | Stack
> >>>> Overflow
> >>>>>>>>>>>>> <https://stackoverflow.com/users/149428/taylor-edmiston <
> https://stackoverflow.com/users/149428/taylor-edmiston> <
> >> https://stackoverflow.com/users/149428/taylor-edmiston <
> https://stackoverflow.com/users/149428/taylor-edmiston>> <
> >>>> https://stackoverflow.com/users/149428/taylor-edmiston <
> https://stackoverflow.com/users/149428/taylor-edmiston> <
> >> https://stackoverflow.com/users/149428/taylor-edmiston <
> https://stackoverflow.com/users/149428/taylor-edmiston>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> On Thu, Jul 26, 2018 at 5:18 PM, Driesprong, Fokko
> >>>>>>>>> <fo...@driesprong.frl <mailto:fo...@driesprong.frl> <mailto:
> fo...@driesprong.frl <mailto:fo...@driesprong.frl>> <mailto:
> >> fo...@driesprong.frl <mailto:fo...@driesprong.frl> <mailto:
> fo...@driesprong.frl <mailto:fo...@driesprong.frl>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> Hi Ry,
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> You should ask Bolke de Bruin. He's really experienced with
> >>>>>> Kerberos
> >>>>>>>>>>> and
> >>>>>>>>>>>>> he
> >>>>>>>>>>>>>> did also the implementation for Airflow. Beside that he
> worked
> >>>>>> also
> >>>>>>>>> on
> >>>>>>>>>>>>>> implementing Kerberos in Ambari. Just want to let you know.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Cheers, Fokko
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Op do 26 jul. 2018 om 23:03 schreef Ry Walker <
> >> r...@astronomer.io <mailto:r...@astronomer.io> <mailto:r...@astronomer.io
> <mailto:r...@astronomer.io>>
> >>>> <mailto:r...@astronomer.io <mailto:r...@astronomer.io> <mailto:
> r...@astronomer.io <mailto:r...@astronomer.io>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Hi everyone -
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> We have several bigCo's who are considering using Airflow
> >>>> asking
> >>>>>>>>> into
> >>>>>>>>>>>>> its
> >>>>>>>>>>>>>>> support for Kerberos.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> We're going to work on a proof-of-concept next week, will
> >>>> likely
> >>>>>>>>>>>>> record a
> >>>>>>>>>>>>>>> screencast on it.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> For now, we're looking for any anecdotal information from
> >>>>>>>>>>> organizations
> >>>>>>>>>>>>>> who
> >>>>>>>>>>>>>>> are using Kerberos with Airflow, if anyone would be willing
> >> to
> >>>>>> share
> >>>>>>>>>>>>>> their
> >>>>>>>>>>>>>>> experiences here, or reply to me personally, it would be
> >>>> greatly
> >>>>>>>>>>>>>>> appreciated!
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> -Ry
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> --
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> *Ry Walker* | CEO, Astronomer <http://www.astronomer.io/ <
> http://www.astronomer.io/> <
> >> http://www.astronomer.io/ <http://www.astronomer.io/>> <
> >>>> http://www.astronomer.io/ <http://www.astronomer.io/> <
> http://www.astronomer.io/ <http://www.astronomer.io/>>>> |
> >>>>>>>>>>>>>> 513.417.2163 |
> >>>>>>>>>>>>>>> @rywalker <http://twitter.com/rywalker <
> http://twitter.com/rywalker> <
> >> http://twitter.com/rywalker <http://twitter.com/rywalker>> <
> >>>> http://twitter.com/rywalker <http://twitter.com/rywalker> <
> http://twitter.com/rywalker <http://twitter.com/rywalker>>>> | LinkedIn
> >>>>>>>>>>>>>>> <http://www.linkedin.com/in/rywalker <
> http://www.linkedin.com/in/rywalker> <
> >> http://www.linkedin.com/in/rywalker <
> http://www.linkedin.com/in/rywalker>> <
> >>>> http://www.linkedin.com/in/rywalker <
> http://www.linkedin.com/in/rywalker> <
> >> http://www.linkedin.com/in/rywalker <
> http://www.linkedin.com/in/rywalker>>>>
>
>

Re: Kerberos and Airflow

Reply via email to