Re: Kerberos and Airflow

Bolke de Bruin Thu, 02 Aug 2018 10:55:57 -0700

Also: using the Kubernetes executor combined with some of the things we
discussed greatly enhances the security of Airflow as the environment 
isn’t really shared anymore.


B.

> On 2 Aug 2018, at 19:51, Bolke de Bruin <bdbr...@gmail.com> wrote:
> 
> Hi Dan,
> 
> I discussed this a little bit with one of the security architects here. We 
> think that 
> you can have a fair trade off between security and usability by having
> a kind of manifest with the dag you are submitting. This manifest can then 
> specify what the generated tasks/dags are allowed to do and what metadata 
> to provide to them. We could also let the scheduler generate hashes per 
> generated
> DAG / task and verify those with an established version (1st run?). This 
> limits the 
> attack vector.
> 
> A DagSerializer would be great, but I think it solves a different issue and 
> the above 
> is somewhat simpler to implement?
> 
> Bolke
> 
>> On 29 Jul 2018, at 23:47, Dan Davydov <ddavy...@twitter.com.INVALID 
>> <mailto:ddavy...@twitter.com.INVALID>> wrote:
>> 
>> *Let’s say we trust the owner field of the DAGs I think we could do the
>> following.*
>> *Obviously, the trusting the user part is key here. It is one of the
>> reasons I was suggesting using “airflow submit” to update / add dags in
>> Airflow*
>> 
>> 
>> *This is the hard part about my question.*
>> I think in a true multi-tenant environment we wouldn't be able to trust the
>> user, otherwise we wouldn't necessarily even need a mapping of Airflow DAG
>> users to secrets, because if we trust users to set the correct Airflow user
>> for DAGs, we are basically trusting them with all of the creds the Airflow
>> scheduler can access for all users anyways.
>> 
>> I actually had the same thought as your "airflow submit" a while ago, which
>> I discussed with Alex, basically creating an API for adding DAGs instead of
>> having the Scheduler parse them. FWIW I think it's superior to the git time
>> machine approach because it's a more generic form of "serialization" and is
>> more correct as well because the same DAG file parsed on a given git SHA
>> can produce different DAGs. Let me know what you think, and maybe I can
>> start a more formal design doc if you are onboard:
>> 
>> A user or service with an auth token sends an "airflow submit" request to a
>> new kind of Dag Serialization service, along with the serialized DAG
>> objects generated by parsing on the client. It's important that these
>> serialized objects are declaritive and not e.g. pickles so that the
>> scheduler/workers can consume them and reproducability of the DAGs is
>> guaranteed. The service will then store each generated DAG along with it's
>> access based on the provided token e.g. using Ranger, and the
>> scheduler/workers will use the stored DAGs for scheduling/execution.
>> Operators would be deployed along with the Airflow code separately from the
>> serialized DAGs.
>> 
>> A serialed DAG would look something like this (basically Luigi-style :)):
>> MyTask - BashOperator: {
>>  cmd: "sleep 1"
>>  user: "Foo"
>>  access: "token1", "token2"
>> }
>> 
>> MyDAG: {
>>  MyTask1 >> SomeOtherTask1
>>  MyTask2 >> SomeOtherTask1
>> }
>> 
>> Dynamic DAGs in this case would just consist of a service calling "Airflow
>> Submit" that does it's own form of authentication to get access to some
>> kind of tokens (or basically just forwarding the secrets the users of the
>> dynamic DAG submit).
>> 
>> For the default Airflow implementation you can maybe just have the Dag
>> Serialization server bundled with the Scheduler, with auth turned off, and
>> to periodically update the Dag Serialization store which would emulate the
>> current behavior closely.
>> 
>> Pros:
>> 1. Consistency across running task instances in a dagrun/scheduler,
>> reproducability and auditability of DAGs
>> 2. Users can control when to deploy their DAGs
>> 3. Scheduler runs much faster since it doesn't have to run python files and
>> e.g. make network calls
>> 4. Scaling scheduler becomes easier because can have different service
>> responsible for parsing DAGs which can be trivially scaled horizontally
>> (clients are doing the parsing)
>> 5. Potentially makes creating ad-hoc DAGs/backfilling/iterating on DAGs
>> easier? e.g. can use the Scheduler itself to schedule backfills with a
>> slightly modified serialized version of a DAG.
>> 
>> Cons:
>> 1. Have to deprecate a lot of popular features, e.g. allowing custom
>> callbacks in operators (e.g. on_failure), and jinja_templates
>> 2. Version compatibility problems, e.g. user/service client might be
>> serializing arguments for hooks/operators that have been deprecated in
>> newer versions of the hooks, or the serialized DAG schema changes and old
>> DAGs aren't automatically updated. Might want to have some kind of
>> versioning system for serialized DAGs to at least ensure that stored DAGs
>> are valid when the Scheduler/Worker/etc are upgraded, maybe something
>> similar to thrift/protobuf versioning.
>> 3. Additional complexity - additional service, logic on workers/scheduler
>> to fetch/cache serialized DAGs efficiently, expiring/archiving old DAG
>> definitions, etc
>> 
>> 
>> On Sun, Jul 29, 2018 at 3:20 PM Bolke de Bruin <bdbr...@gmail.com 
>> <mailto:bdbr...@gmail.com>> wrote:
>> 
>>> Ah gotcha. That’s another issue actually (but related).
>>> 
>>> Let’s say we trust the owner field of the DAGs I think we could do the
>>> following. We then have a table (and interface) to tell Airflow what users
>>> have access to what connections. The scheduler can then check if the task
>>> in the dag can access the conn_id it is asking for. Auto generated dags
>>> still have an owner (or should) and therefore should be fine. Some
>>> integrity checking could/should be added as we want to be sure that the
>>> task we schedule is the task we launch. So a signature calculated at the
>>> scheduler (or part of the DAG), send as part of the metadata and checked by
>>> the executor is probably smart.
>>> 
>>> You can also make this more fancy by integrating with something like
>>> Apache Ranger that allows for policy checking.
>>> 
>>> Obviously, the trusting the user part is key here. It is one of the
>>> reasons I was suggesting using “airflow submit” to update / add dags in
>>> Airflow. We could enforce authentication on the DAG. It was kind of ruled
>>> out in favor of git time machines although these never happened afaik ;-).
>>> 
>>> BTW: I have updated my implementation with protobuf. Metadata is now
>>> available at executor and task.
>>> 
>>> 
>>>> On 29 Jul 2018, at 15:47, Dan Davydov <ddavy...@twitter.com.INVALID 
>>>> <mailto:ddavy...@twitter.com.INVALID>>
>>> wrote:
>>>> 
>>>> The concern is how to secure secrets on the scheduler such that only
>>>> certain DAGs can access them, and in the case of files that create DAGs
>>>> dynamically, only some set of DAGs should be able to access these
>>> secrets.
>>>> 
>>>> e.g. if there is a secret/keytab that can be read by DAG A generated by
>>>> file X, and file X generates DAG B as well, there needs to be a scheme to
>>>> stop the parsing of DAG B on the scheduler from being able to read the
>>>> secret in DAG A.
>>>> 
>>>> Does that make sense?
>>>> 
>>>> On Sun, Jul 29, 2018 at 6:14 AM Bolke de Bruin <bdbr...@gmail.com 
>>>> <mailto:bdbr...@gmail.com>
>>> <mailto:bdbr...@gmail.com <mailto:bdbr...@gmail.com>>> wrote:
>>>> 
>>>>> I’m not sure what you mean. The example I created allows for dynamic
>>> DAGs,
>>>>> as the scheduler obviously knows about the tasks when they are ready to
>>> be
>>>>> scheduled.
>>>>> This isn’t any different from a static DAG or a dynamic one.
>>>>> 
>>>>> For Kerberos it isnt that special. Basically a keytab are the revokable
>>>>> users credentials
>>>>> in a special format. The keytab itself can be protected by a password.
>>> So
>>>>> I can imagine
>>>>> that a connection is defined that sets a keytab location and password to
>>>>> access the keytab.
>>>>> The scheduler understands this (or maybe the Connection model) and
>>>>> serializes and sends
>>>>> it to the worker as part of the metadata. The worker then reconstructs
>>> the
>>>>> keytab and issues
>>>>> a kinit or supplies it to the other service requiring it (eg. Spark)
>>>>> 
>>>>> * Obviously the worker and scheduler need to communicate over SSL.
>>>>> * There is a challenge at the worker level. Credentials are secured
>>>>> against other users, but are readable by the owning user. So imagine 2
>>> DAGs
>>>>> from two different users with different connections without sudo
>>>>> configured. If they end up at the same worker if DAG 2 is malicious it
>>>>> could read files and memory created by DAG 1. This is the reason why
>>> using
>>>>> environment variables are NOT safe (DAG 2 could read
>>> /proc/<pid>/environ).
>>>>> To mitigate this we probably need to PIPE the data to the task’s STDIN.
>>> It
>>>>> won’t solve the issue but will make it harder as now it will only be in
>>>>> memory.
>>>>> * The reconstructed keytab (or the initalized version) can be stored in,
>>>>> most likely, the process-keyring (
>>>>> http://man7.org/linux/man-pages/man7/process-keyring.7.html 
>>>>> <http://man7.org/linux/man-pages/man7/process-keyring.7.html> <
>>>>> http://man7.org/linux/man-pages/man7/process-keyring.7.html 
>>>>> <http://man7.org/linux/man-pages/man7/process-keyring.7.html> <
>>> http://man7.org/linux/man-pages/man7/process-keyring.7.html 
>>> <http://man7.org/linux/man-pages/man7/process-keyring.7.html>>>). As
>>>>> mentioned earlier this poses a challenge for Java applications that
>>> cannot
>>>>> read from this location (keytab an ccache). Writing it out to the
>>>>> filesystem then becomes a possibility. This is essentially the same how
>>>>> Spark solves it (
>>>>> https://spark.apache.org/docs/latest/security.html#yarn-mode 
>>>>> <https://spark.apache.org/docs/latest/security.html#yarn-mode> <
>>> https://spark.apache.org/docs/latest/security.html#yarn-mode 
>>> <https://spark.apache.org/docs/latest/security.html#yarn-mode>> <
>>>>> https://spark.apache.org/docs/latest/security.html#yarn-mode 
>>>>> <https://spark.apache.org/docs/latest/security.html#yarn-mode> <
>>> https://spark.apache.org/docs/latest/security.html#yarn-mode 
>>> <https://spark.apache.org/docs/latest/security.html#yarn-mode>>>).
>>>>> 
>>>>> Why not work on this together? We need it as well. Airflow as it is now
>>> we
>>>>> consider the biggest security threat and it is really hard to secure it.
>>>>> The above would definitely be a serious improvement. Another step would
>>> be
>>>>> to stop Tasks from accessing the Airflow DB all together.
>>>>> 
>>>>> Cheers
>>>>> Bolke
>>>>> 
>>>>>> On 29 Jul 2018, at 05:36, Dan Davydov <ddavy...@twitter.com.INVALID 
>>>>>> <mailto:ddavy...@twitter.com.INVALID>
>>> <mailto:ddavy...@twitter.com.INVALID <mailto:ddavy...@twitter.com.INVALID>>>
>>>>> wrote:
>>>>>> 
>>>>>> This makes sense, and thanks for putting this together. I might pick
>>> this
>>>>>> up myself depending on if we can get the rest of the mutli-tenancy
>>> story
>>>>>> nailed down, but I still think the tricky part is figuring out how to
>>>>> allow
>>>>>> dynamic DAGs (e.g. DAGs created from rows in a Mysql table) to work
>>> with
>>>>>> Kerberos, curious what your thoughts are there. How would secrets be
>>>>> passed
>>>>>> securely in a multi-tenant Scheduler starting from parsing the DAGs up
>>> to
>>>>>> the executor sending them off?
>>>>>> 
>>>>>> On Sat, Jul 28, 2018 at 5:07 PM Bolke de Bruin <bdbr...@gmail.com 
>>>>>> <mailto:bdbr...@gmail.com>
>>> <mailto:bdbr...@gmail.com <mailto:bdbr...@gmail.com>>
>>>>> <mailto:bdbr...@gmail.com <mailto:bdbr...@gmail.com> 
>>>>> <mailto:bdbr...@gmail.com <mailto:bdbr...@gmail.com>>>> wrote:
>>>>>> 
>>>>>>> Here:
>>>>>>> 
>>>>>>> https://github.com/bolkedebruin/airflow/tree/secure_connections 
>>>>>>> <https://github.com/bolkedebruin/airflow/tree/secure_connections> <
>>> https://github.com/bolkedebruin/airflow/tree/secure_connections 
>>> <https://github.com/bolkedebruin/airflow/tree/secure_connections>> <
>>>>> https://github.com/bolkedebruin/airflow/tree/secure_connections 
>>>>> <https://github.com/bolkedebruin/airflow/tree/secure_connections> <
>>> https://github.com/bolkedebruin/airflow/tree/secure_connections 
>>> <https://github.com/bolkedebruin/airflow/tree/secure_connections>>> <
>>>>>>> https://github.com/bolkedebruin/airflow/tree/secure_connections 
>>>>>>> <https://github.com/bolkedebruin/airflow/tree/secure_connections> <
>>> https://github.com/bolkedebruin/airflow/tree/secure_connections 
>>> <https://github.com/bolkedebruin/airflow/tree/secure_connections>> <
>>>>> https://github.com/bolkedebruin/airflow/tree/secure_connections 
>>>>> <https://github.com/bolkedebruin/airflow/tree/secure_connections> <
>>> https://github.com/bolkedebruin/airflow/tree/secure_connections 
>>> <https://github.com/bolkedebruin/airflow/tree/secure_connections>>>>
>>>>>>> 
>>>>>>> Is a working rudimentary implementation that allows securing the
>>>>>>> connections (only LocalExecutor at the moment)
>>>>>>> 
>>>>>>> * It enforces the use of “conn_id” instead of the mix that we have now
>>>>>>> * A task if using “conn_id” has ‘auto-registered’ (which is a noop)
>>> its
>>>>>>> connections
>>>>>>> * The scheduler reads the connection informations and serializes it to
>>>>>>> json (which should be a different format, protobuf preferably)
>>>>>>> * The scheduler then sends this info to the executor
>>>>>>> * The executor puts this in the environment of the task (environment
>>>>> most
>>>>>>> likely not secure enough for us)
>>>>>>> * The BaseHook reads out this environment variable and does not need
>>> to
>>>>>>> touch the database
>>>>>>> 
>>>>>>> The example_http_operator works, I havent tested any other. To make it
>>>>>>> work I just adjusted the hook and operator to use “conn_id” instead
>>>>>>> of the non standard http_conn_id.
>>>>>>> 
>>>>>>> Makes sense?
>>>>>>> 
>>>>>>> B.
>>>>>>> 
>>>>>>> * The BaseHook is adjusted to not connect to the database
>>>>>>>> On 28 Jul 2018, at 17:50, Bolke de Bruin <bdbr...@gmail.com 
>>>>>>>> <mailto:bdbr...@gmail.com> <mailto:
>>> bdbr...@gmail.com <mailto:bdbr...@gmail.com>>> wrote:
>>>>>>>> 
>>>>>>>> Well, I don’t think a hook (or task) should be obtain it by itself.
>>> It
>>>>>>> should be supplied.
>>>>>>>> At the moment you start executing the task you cannot trust it
>>> anymore
>>>>>>> (ie. it is unmanaged
>>>>>>>> / non airflow code).
>>>>>>>> 
>>>>>>>> So we could change the basehook to understand supplied credentials
>>> and
>>>>>>> populate
>>>>>>>> a hash with “conn_ids”. Hooks normally call BaseHook.get_connection
>>>>>>> anyway, so
>>>>>>>> it shouldnt be too hard and should in principle not require changes
>>> to
>>>>>>> the hooks
>>>>>>>> themselves if they are well behaved.
>>>>>>>> 
>>>>>>>> B.
>>>>>>>> 
>>>>>>>>> On 28 Jul 2018, at 17:41, Dan Davydov <ddavy...@twitter.com.INVALID 
>>>>>>>>> <mailto:ddavy...@twitter.com.INVALID>
>>> <mailto:ddavy...@twitter.com.INVALID <mailto:ddavy...@twitter.com.INVALID>>
>>>>>>> <mailto:ddavy...@twitter.com.INVALID 
>>>>>>> <mailto:ddavy...@twitter.com.INVALID> <mailto:
>>> ddavy...@twitter.com.INVALID <mailto:ddavy...@twitter.com.INVALID>> <mailto:
>>>>> ddavy...@twitter.com.INVALID <mailto:ddavy...@twitter.com.INVALID> 
>>>>> <mailto:ddavy...@twitter.com.INVALID 
>>>>> <mailto:ddavy...@twitter.com.INVALID>>>>>
>>> wrote:
>>>>>>>>> 
>>>>>>>>> *So basically in the scheduler we parse the dag. Either from the
>>>>>>> manifest
>>>>>>>>> (new) or from smart parsing (probably harder, maybe some auto
>>>>>>> register?) we
>>>>>>>>> know what connections and keytabs are available dag wide or per
>>> task.*
>>>>>>>>> This is the hard part that I was curious about, for dynamically
>>>>> created
>>>>>>>>> DAGs, e.g. those generated by reading tasks in a MySQL database or a
>>>>>>> json
>>>>>>>>> file, there isn't a great way to do this.
>>>>>>>>> 
>>>>>>>>> I 100% agree with deprecating the connections table (at least for
>>> the
>>>>>>>>> secure option). The main work there is rewriting all hooks to take
>>>>>>>>> credentials from arbitrary data sources by allowing a customized
>>>>>>>>> CredentialsReader class. Although hooks are technically private, I
>>>>>>> think a
>>>>>>>>> lot of companies depend on them so the PMC should probably discuss
>>> if
>>>>>>> this
>>>>>>>>> is an Airflow 2.0 change or not.
>>>>>>>>> 
>>>>>>>>> On Fri, Jul 27, 2018 at 5:24 PM Bolke de Bruin <bdbr...@gmail.com 
>>>>>>>>> <mailto:bdbr...@gmail.com>
>>> <mailto:bdbr...@gmail.com <mailto:bdbr...@gmail.com>>
>>>>> <mailto:bdbr...@gmail.com <mailto:bdbr...@gmail.com> 
>>>>> <mailto:bdbr...@gmail.com <mailto:bdbr...@gmail.com>>>
>>>>>>> <mailto:bdbr...@gmail.com <mailto:bdbr...@gmail.com> 
>>>>>>> <mailto:bdbr...@gmail.com <mailto:bdbr...@gmail.com>> <mailto:
>>> bdbr...@gmail.com <mailto:bdbr...@gmail.com> <mailto:bdbr...@gmail.com 
>>> <mailto:bdbr...@gmail.com>>>>> wrote:
>>>>>>>>> 
>>>>>>>>>> Sure. In general I consider keytabs as a part of connection
>>>>>>> information.
>>>>>>>>>> Connections should be secured by sending the connection
>>> information a
>>>>>>> task
>>>>>>>>>> needs as part of information the executor gets. A task should then
>>>>> not
>>>>>>> need
>>>>>>>>>> access to the connection table in Airflow. Keytabs could then be
>>> send
>>>>>>> as
>>>>>>>>>> part of the connection information (base64 encoded) and setup by
>>> the
>>>>>>>>>> executor (this key) to be read only to the task it is launching.
>>>>>>>>>> 
>>>>>>>>>> So basically in the scheduler we parse the dag. Either from the
>>>>>>> manifest
>>>>>>>>>> (new) or from smart parsing (probably harder, maybe some auto
>>>>>>> register?) we
>>>>>>>>>> know what connections and keytabs are available dag wide or per
>>> task.
>>>>>>>>>> 
>>>>>>>>>> The credentials and connection information then are serialized
>>> into a
>>>>>>>>>> protobuf message and send to the executor as part of the “queue”
>>>>>>> action.
>>>>>>>>>> The worker then deserializes the information and makes it securely
>>>>>>>>>> available to the task (which is quite hard btw).
>>>>>>>>>> 
>>>>>>>>>> On that last bit making the info securely available might be
>>> storing
>>>>>>> it in
>>>>>>>>>> the Linux KEYRING (supported by python keyring). Keytabs will be
>>>>> tough
>>>>>>> to
>>>>>>>>>> do properly due to Java not properly supporting KEYRING and only
>>>>> files
>>>>>>> and
>>>>>>>>>> these are hard to make secure (due to the possibility a process
>>> will
>>>>>>> list
>>>>>>>>>> all files in /tmp and get credentials through that). Maybe storing
>>>>> the
>>>>>>>>>> keytab with a password and having the password in the KEYRING might
>>>>>>> work.
>>>>>>>>>> Something to find out.
>>>>>>>>>> 
>>>>>>>>>> B.
>>>>>>>>>> 
>>>>>>>>>> Verstuurd vanaf mijn iPad
>>>>>>>>>> 
>>>>>>>>>>> Op 27 jul. 2018 om 22:04 heeft Dan Davydov
>>>>>>> <ddavy...@twitter.com.INVALID <mailto:ddavy...@twitter.com.INVALID> 
>>>>>>> <mailto:ddavy...@twitter.com.INVALID 
>>>>>>> <mailto:ddavy...@twitter.com.INVALID>>
>>> <mailto:ddavy...@twitter.com.INVALID <mailto:ddavy...@twitter.com.INVALID> 
>>> <mailto:ddavy...@twitter.com.INVALID <mailto:ddavy...@twitter.com.INVALID>
>>>>> 
>>>>> <mailto:ddavy...@twitter.com.INVALID 
>>>>> <mailto:ddavy...@twitter.com.INVALID> <mailto:
>>> ddavy...@twitter.com.INVALID <mailto:ddavy...@twitter.com.INVALID>> 
>>> <mailto:ddavy...@twitter.com.INVALID <mailto:ddavy...@twitter.com.INVALID>
>>> <mailto:ddavy...@twitter.com.INVALID <mailto:ddavy...@twitter.com.INVALID>>
>>>>>>>> 
>>>>>>>>>> het volgende geschreven:
>>>>>>>>>>> 
>>>>>>>>>>> I'm curious if you had any ideas in terms of ideas to enable
>>>>>>>>>> multi-tenancy
>>>>>>>>>>> with respect to Kerberos in Airflow.
>>>>>>>>>>> 
>>>>>>>>>>>> On Fri, Jul 27, 2018 at 2:38 PM Bolke de Bruin <
>>> bdbr...@gmail.com <mailto:bdbr...@gmail.com> <mailto:bdbr...@gmail.com 
>>> <mailto:bdbr...@gmail.com>>
>>>>> <mailto:bdbr...@gmail.com <mailto:bdbr...@gmail.com> 
>>>>> <mailto:bdbr...@gmail.com <mailto:bdbr...@gmail.com>>>
>>>>>>> <mailto:bdbr...@gmail.com <mailto:bdbr...@gmail.com> 
>>>>>>> <mailto:bdbr...@gmail.com <mailto:bdbr...@gmail.com>> <mailto:
>>> bdbr...@gmail.com <mailto:bdbr...@gmail.com> <mailto:bdbr...@gmail.com 
>>> <mailto:bdbr...@gmail.com>>>>>
>>>>>>>>>> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>> Cool. The doc will need some refinement as it isn't entirely
>>>>>>> accurate.
>>>>>>>>>> In
>>>>>>>>>>>> addition we need to separate between Airflow as a client of
>>>>>>> kerberized
>>>>>>>>>>>> services (this is what is talked about in the astronomer doc) vs
>>>>>>>>>>>> kerberizing airflow itself, which the API supports.
>>>>>>>>>>>> 
>>>>>>>>>>>> In general to access kerberized services (airflow as a client)
>>> one
>>>>>>> needs
>>>>>>>>>>>> to start the ticket renewer with a valid keytab. For the hooks it
>>>>>>> isn't
>>>>>>>>>>>> always required to change the hook to support it. Hadoop cli
>>> tools
>>>>>>> often
>>>>>>>>>>>> just pick it up as their client config is set to do so. Then
>>>>> another
>>>>>>>>>> class
>>>>>>>>>>>> is there for HTTP-like services which are accessed by urllib
>>> under
>>>>>>> the
>>>>>>>>>>>> hood, these typically use SPNEGO. These often need to be adjusted
>>>>> as
>>>>>>> it
>>>>>>>>>>>> requires some urllib config. Finally, there are protocols which
>>> use
>>>>>>> SASL
>>>>>>>>>>>> with kerberos. Like HDFS (not webhdfs, that uses SPNEGO). These
>>>>>>> require
>>>>>>>>>> per
>>>>>>>>>>>> protocol implementations.
>>>>>>>>>>>> 
>>>>>>>>>>>> From the top of my head we support kerberos client side now with:
>>>>>>>>>>>> 
>>>>>>>>>>>> * Spark
>>>>>>>>>>>> * HDFS (snakebite python 2.7, cli and with the upcoming libhdfs
>>>>>>>>>>>> implementation)
>>>>>>>>>>>> * Hive (not metastore afaik)
>>>>>>>>>>>> 
>>>>>>>>>>>> Two things to remember:
>>>>>>>>>>>> 
>>>>>>>>>>>> * If a job (ie. Spark job) will finish later than the maximum
>>>>> ticket
>>>>>>>>>>>> lifetime you probably need to provide a keytab to said
>>> application.
>>>>>>>>>>>> Otherwise you will get failures after the expiry.
>>>>>>>>>>>> * A keytab (used by the renewer) are credentials (user and pass)
>>> so
>>>>>>> jobs
>>>>>>>>>>>> are executed under the keytab in use at that moment
>>>>>>>>>>>> * Securing keytab in multi tenancy airflow is a challenge. This
>>>>> also
>>>>>>>>>> goes
>>>>>>>>>>>> for securing connections. This we need to fix at some point.
>>>>> Solution
>>>>>>>>>> for
>>>>>>>>>>>> now seems to be no multi tenancy.
>>>>>>>>>>>> 
>>>>>>>>>>>> Kerberos seems harder than it is btw. Still, we are sometimes
>>>>> moving
>>>>>>>>>> away
>>>>>>>>>>>> from it to OAUTH2 based authentication. This gets use closer to
>>>>> cloud
>>>>>>>>>>>> standards (but we are on prem)
>>>>>>>>>>>> 
>>>>>>>>>>>> B.
>>>>>>>>>>>> 
>>>>>>>>>>>> Sent from my iPhone
>>>>>>>>>>>> 
>>>>>>>>>>>>> On 27 Jul 2018, at 17:41, Hitesh Shah <hit...@apache.org 
>>>>>>>>>>>>> <mailto:hit...@apache.org>
>>> <mailto:hit...@apache.org <mailto:hit...@apache.org>> <mailto:
>>>>> hit...@apache.org <mailto:hit...@apache.org> <mailto:hit...@apache.org 
>>>>> <mailto:hit...@apache.org>>> <mailto:
>>>>>>> hit...@apache.org <mailto:hit...@apache.org> <mailto:hit...@apache.org 
>>>>>>> <mailto:hit...@apache.org>> <mailto:
>>> hit...@apache.org <mailto:hit...@apache.org> <mailto:hit...@apache.org 
>>> <mailto:hit...@apache.org>>>>> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Hi Taylor
>>>>>>>>>>>>> 
>>>>>>>>>>>>> +1 on upstreaming this. It would be great if you can submit a
>>> pull
>>>>>>>>>>>> request
>>>>>>>>>>>>> to enhance the apache airflow docs.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> thanks
>>>>>>>>>>>>> Hitesh
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On Thu, Jul 26, 2018 at 2:32 PM Taylor Edmiston <
>>>>>>> tedmis...@gmail.com <mailto:tedmis...@gmail.com> 
>>>>>>> <mailto:tedmis...@gmail.com <mailto:tedmis...@gmail.com>> <mailto:
>>> tedmis...@gmail.com <mailto:tedmis...@gmail.com> 
>>> <mailto:tedmis...@gmail.com <mailto:tedmis...@gmail.com>>> <mailto:
>>>>> tedmis...@gmail.com <mailto:tedmis...@gmail.com> 
>>>>> <mailto:tedmis...@gmail.com <mailto:tedmis...@gmail.com>> <mailto:
>>> tedmis...@gmail.com <mailto:tedmis...@gmail.com> 
>>> <mailto:tedmis...@gmail.com <mailto:tedmis...@gmail.com>>>>>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> While we're on the topic, I'd love any feedback from Bolke or
>>>>>>> others
>>>>>>>>>>>> who've
>>>>>>>>>>>>>> used Kerberos with Airflow on this quick guide I put together
>>>>>>>>>> yesterday.
>>>>>>>>>>>>>> It's similar to what's in the Airflow docs but instead all on
>>> one
>>>>>>> page
>>>>>>>>>>>>>> and slightly
>>>>>>>>>>>>>> expanded.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>> 
>>>>> 
>>> https://github.com/astronomerio/airflow-guides/blob/master/guides/kerberos.md
>>>  
>>> <https://github.com/astronomerio/airflow-guides/blob/master/guides/kerberos.md>
>>> <
>>> https://github.com/astronomerio/airflow-guides/blob/master/guides/kerberos.md
>>>  
>>> <https://github.com/astronomerio/airflow-guides/blob/master/guides/kerberos.md>
>>>> 
>>>>> <
>>>>> 
>>> https://github.com/astronomerio/airflow-guides/blob/master/guides/kerberos.md
>>>  
>>> <https://github.com/astronomerio/airflow-guides/blob/master/guides/kerberos.md>
>>> <
>>> https://github.com/astronomerio/airflow-guides/blob/master/guides/kerberos.md
>>>  
>>> <https://github.com/astronomerio/airflow-guides/blob/master/guides/kerberos.md>
>>>> 
>>>>>> 
>>>>>>> <
>>>>>>> 
>>>>> 
>>> https://github.com/astronomerio/airflow-guides/blob/master/guides/kerberos.md
>>>  
>>> <https://github.com/astronomerio/airflow-guides/blob/master/guides/kerberos.md>
>>> <
>>> https://github.com/astronomerio/airflow-guides/blob/master/guides/kerberos.md
>>>  
>>> <https://github.com/astronomerio/airflow-guides/blob/master/guides/kerberos.md>
>>>> 
>>>>> <
>>>>> 
>>> https://github.com/astronomerio/airflow-guides/blob/master/guides/kerberos.md
>>>  
>>> <https://github.com/astronomerio/airflow-guides/blob/master/guides/kerberos.md>
>>> <
>>> https://github.com/astronomerio/airflow-guides/blob/master/guides/kerberos.md
>>>  
>>> <https://github.com/astronomerio/airflow-guides/blob/master/guides/kerberos.md>
>>>> 
>>>>>> 
>>>>>>>> 
>>>>>>>>>>>>>> (or web version <https://www.astronomer.io/guides/kerberos/ 
>>>>>>>>>>>>>> <https://www.astronomer.io/guides/kerberos/> <
>>> https://www.astronomer.io/guides/kerberos/ 
>>> <https://www.astronomer.io/guides/kerberos/>> <
>>>>> https://www.astronomer.io/guides/kerberos/ 
>>>>> <https://www.astronomer.io/guides/kerberos/> <
>>> https://www.astronomer.io/guides/kerberos/ 
>>> <https://www.astronomer.io/guides/kerberos/>>>>)
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> One thing I'd like to add is a minimal example of how to
>>>>> Kerberize
>>>>>>> a
>>>>>>>>>>>> hook.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> I'd be happy to upstream this as well if it's useful (maybe a
>>>>>>>>>> Concepts >
>>>>>>>>>>>>>> Additional Functionality > Kerberos page?)
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>> Taylor
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> *Taylor Edmiston*
>>>>>>>>>>>>>> Blog <https://blog.tedmiston.com/ <https://blog.tedmiston.com/> 
>>>>>>>>>>>>>> <https://blog.tedmiston.com/ <https://blog.tedmiston.com/>>
>>> <https://blog.tedmiston.com/ <https://blog.tedmiston.com/> 
>>> <https://blog.tedmiston.com/ <https://blog.tedmiston.com/>>>>
>>>>> | CV
>>>>>>>>>>>>>> <https://stackoverflow.com/cv/taylor 
>>>>>>>>>>>>>> <https://stackoverflow.com/cv/taylor> <
>>> https://stackoverflow.com/cv/taylor <https://stackoverflow.com/cv/taylor>> <
>>>>> https://stackoverflow.com/cv/taylor <https://stackoverflow.com/cv/taylor> 
>>>>> <
>>> https://stackoverflow.com/cv/taylor 
>>> <https://stackoverflow.com/cv/taylor>>>> | LinkedIn
>>>>>>>>>>>>>> <https://www.linkedin.com/in/tedmiston/ 
>>>>>>>>>>>>>> <https://www.linkedin.com/in/tedmiston/> <
>>> https://www.linkedin.com/in/tedmiston/ 
>>> <https://www.linkedin.com/in/tedmiston/>> <
>>>>> https://www.linkedin.com/in/tedmiston/ 
>>>>> <https://www.linkedin.com/in/tedmiston/> <
>>> https://www.linkedin.com/in/tedmiston/ 
>>> <https://www.linkedin.com/in/tedmiston/>>>> | AngelList
>>>>>>>>>>>>>> <https://angel.co/taylor <https://angel.co/taylor> 
>>>>>>>>>>>>>> <https://angel.co/taylor <https://angel.co/taylor>> <
>>> https://angel.co/taylor <https://angel.co/taylor> <https://angel.co/taylor 
>>> <https://angel.co/taylor>>>> | Stack
>>>>> Overflow
>>>>>>>>>>>>>> <https://stackoverflow.com/users/149428/taylor-edmiston 
>>>>>>>>>>>>>> <https://stackoverflow.com/users/149428/taylor-edmiston> <
>>> https://stackoverflow.com/users/149428/taylor-edmiston 
>>> <https://stackoverflow.com/users/149428/taylor-edmiston>> <
>>>>> https://stackoverflow.com/users/149428/taylor-edmiston 
>>>>> <https://stackoverflow.com/users/149428/taylor-edmiston> <
>>> https://stackoverflow.com/users/149428/taylor-edmiston 
>>> <https://stackoverflow.com/users/149428/taylor-edmiston>>>>
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On Thu, Jul 26, 2018 at 5:18 PM, Driesprong, Fokko
>>>>>>>>>> <fo...@driesprong.frl <mailto:fo...@driesprong.frl> 
>>>>>>>>>> <mailto:fo...@driesprong.frl <mailto:fo...@driesprong.frl>> <mailto:
>>> fo...@driesprong.frl <mailto:fo...@driesprong.frl> 
>>> <mailto:fo...@driesprong.frl <mailto:fo...@driesprong.frl>>>
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Hi Ry,
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> You should ask Bolke de Bruin. He's really experienced with
>>>>>>> Kerberos
>>>>>>>>>>>> and
>>>>>>>>>>>>>> he
>>>>>>>>>>>>>>> did also the implementation for Airflow. Beside that he worked
>>>>>>> also
>>>>>>>>>> on
>>>>>>>>>>>>>>> implementing Kerberos in Ambari. Just want to let you know.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Cheers, Fokko
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Op do 26 jul. 2018 om 23:03 schreef Ry Walker <
>>> r...@astronomer.io <mailto:r...@astronomer.io> <mailto:r...@astronomer.io 
>>> <mailto:r...@astronomer.io>>
>>>>> <mailto:r...@astronomer.io <mailto:r...@astronomer.io> 
>>>>> <mailto:r...@astronomer.io <mailto:r...@astronomer.io>>>>
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Hi everyone -
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> We have several bigCo's who are considering using Airflow
>>>>> asking
>>>>>>>>>> into
>>>>>>>>>>>>>> its
>>>>>>>>>>>>>>>> support for Kerberos.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> We're going to work on a proof-of-concept next week, will
>>>>> likely
>>>>>>>>>>>>>> record a
>>>>>>>>>>>>>>>> screencast on it.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> For now, we're looking for any anecdotal information from
>>>>>>>>>>>> organizations
>>>>>>>>>>>>>>> who
>>>>>>>>>>>>>>>> are using Kerberos with Airflow, if anyone would be willing
>>> to
>>>>>>> share
>>>>>>>>>>>>>>> their
>>>>>>>>>>>>>>>> experiences here, or reply to me personally, it would be
>>>>> greatly
>>>>>>>>>>>>>>>> appreciated!
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> -Ry
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> *Ry Walker* | CEO, Astronomer <http://www.astronomer.io/ 
>>>>>>>>>>>>>>>> <http://www.astronomer.io/> <
>>> http://www.astronomer.io/ <http://www.astronomer.io/>> <
>>>>> http://www.astronomer.io/ <http://www.astronomer.io/> 
>>>>> <http://www.astronomer.io/ <http://www.astronomer.io/>>>> |
>>>>>>>>>>>>>>> 513.417.2163 |
>>>>>>>>>>>>>>>> @rywalker <http://twitter.com/rywalker 
>>>>>>>>>>>>>>>> <http://twitter.com/rywalker> <
>>> http://twitter.com/rywalker <http://twitter.com/rywalker>> <
>>>>> http://twitter.com/rywalker <http://twitter.com/rywalker> 
>>>>> <http://twitter.com/rywalker <http://twitter.com/rywalker>>>> | LinkedIn
>>>>>>>>>>>>>>>> <http://www.linkedin.com/in/rywalker 
>>>>>>>>>>>>>>>> <http://www.linkedin.com/in/rywalker> <
>>> http://www.linkedin.com/in/rywalker <http://www.linkedin.com/in/rywalker>> <
>>>>> http://www.linkedin.com/in/rywalker <http://www.linkedin.com/in/rywalker> 
>>>>> <
>>> http://www.linkedin.com/in/rywalker <http://www.linkedin.com/in/rywalker>>>>
>

Re: Kerberos and Airflow

Reply via email to