Re: What information is passed around different components of Airflow?

James Meickle Thu, 05 Jul 2018 05:44:38 -0700

Airflow logs are stored on the worker filesystem. When a worker starts, it
runs a subprocess that serves logs via Flask:
https://github.com/apache/incubator-airflow/blob/master/airflow/bin/cli.py#L985

If you use the remote logging feature, the logs are (instead? also?) stored
in S3.

Postgres stores most everything you see in the UI. Task and DAG state, user
accounts, privileges (in RBAC), variables, connections, etc.

I believe the rabbitmq information is just task states and names, and that
the workers fetch most of what they need from the database. But if you
intercepted it you could manipulate which tasks are being run, so I'd still
treat it as sensitive.

On Wed, Jul 4, 2018 at 5:37 PM, Kevin Lam <ke...@fathomhealth.co> wrote:

> Hi,
>
> We run Apache Airflow as a set of k8s deployments inside of a GKE cluster,
> similar to the way specified in Mumoshu's github repo:
> https://github.com/mumoshu/kube-airflow.
>
> We are investigating securing our use of Airflow and are wondering about
> some of Airflow's implementation details. Specifically, we run some tasks
> where the workers have access to sensitive data. Some of the data can make
> its way into the task logs. However, we want to make sure isn't passed
> around eg. to the scheduler/database/message queue, and if it is, it should
> be encrypted in any network traffic (eg. via mutual tls).
>
> - Does airflow pass around logs to the postgres db, or rabbitmq?
> - Is the information in postgres mainly operational in nature?
> - Is the information in rabbitmq mainly operational in nature?
> - What about the scheduler?
> - Anything else we're missing?
>
> Any ideas are appreciated!
>
> Thanks in advance!
>

Re: What information is passed around different components of Airflow?

Reply via email to