Hi Rafal, In our production, we have a few bulk queries:
1. We have a cron job to query the replica of the airflow db for number of tis in different states, the number of tasks in queued, running, and the number of tasks whose state is changed to 'scheduled', 'running', 'queued' in the last minute window (we have a task_instance trigger table to track all state changes of tis). It then emits metrics that we have alerts on. 2. We have another job to periodically trim historical data in task_instance, log, dag_run, etc tables, if their state is failed/success. This prevents those tables from growing too big to have performance impact. (when we do deletion, we carefully break down the query into smaller chunks to avoid locking the table for too long, which might cause issues) 3. We have a script to easily reset failed tis given a period of time so that when there is an infra failure, we can reset those failed tasks due to infra failure. I hope this can help. As for the Airflow API part, I think it depends on the use cases. If the use cases are for infra teams who manage the airflow clusters, I am leaning towards not having an API. But if it is for other users, it is better to have an API around it as we don't want them to have access to prod db. Thanks, Ping On Thu, May 19, 2022 at 10:31 AM Rafal Biegacz <[email protected]> wrote: > Hi Airflow Community, > > Some heavy users of Airflow rely on the possibility to run bulk sql > queries against Airflow DB. > > My gut feeling is that one of the reason they do that is to implement some > additional visualizations and monitoring for Airflow tasks, DAGs and sense > state of task scheduling. > > For example, I see that users do the following: > > - retrieving all tasks that are in the queued state > - deleting all the tasks that are in the queued state for too long > - query DB to retrieve info about all DAGs that failed in the > specified period of time. > - etc. > > Just wondering if we shouldn't implement support for some such queries via > Airflow API to eliminate a need to query Airflow database directly. > This request might be even more important in the context of AIP-44 > Airflow Internal API > <https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-44+Airflow+Internal+API> > which > tries to limit/control direct access to the Airflow Database. > > It would be great if you could share more info about such "bulk" queries > that you do/saw in practice and it would be great to hear from you what you > think about implementing support for some of them in Airflow API (stable or > internal). > > Regards, Rafal. >
