Airflow logs are stored on the worker filesystem. When a worker starts, it runs a subprocess that serves logs via Flask: https://github.com/apache/incubator-airflow/blob/master/airflow/bin/cli.py#L985
If you use the remote logging feature, the logs are (instead? also?) stored in S3. Postgres stores most everything you see in the UI. Task and DAG state, user accounts, privileges (in RBAC), variables, connections, etc. I believe the rabbitmq information is just task states and names, and that the workers fetch most of what they need from the database. But if you intercepted it you could manipulate which tasks are being run, so I'd still treat it as sensitive. On Wed, Jul 4, 2018 at 5:37 PM, Kevin Lam <ke...@fathomhealth.co> wrote: > Hi, > > We run Apache Airflow as a set of k8s deployments inside of a GKE cluster, > similar to the way specified in Mumoshu's github repo: > https://github.com/mumoshu/kube-airflow. > > We are investigating securing our use of Airflow and are wondering about > some of Airflow's implementation details. Specifically, we run some tasks > where the workers have access to sensitive data. Some of the data can make > its way into the task logs. However, we want to make sure isn't passed > around eg. to the scheduler/database/message queue, and if it is, it should > be encrypted in any network traffic (eg. via mutual tls). > > - Does airflow pass around logs to the postgres db, or rabbitmq? > - Is the information in postgres mainly operational in nature? > - Is the information in rabbitmq mainly operational in nature? > - What about the scheduler? > - Anything else we're missing? > > Any ideas are appreciated! > > Thanks in advance! >