Hello Airflow Community, 

Due to some unique circumstances we are running all of our backfills through 
the Rundeck, it is been working pretty great; however, we’re running into some 
issues. 

When a backfill is run via the airflow cli, a record for the job is created in 
airflow's `job` table with a `job_type` of "BackfillJob". We use this record to 
track the status of backfills that have been kicked off. Of particular 
usefulness is the "latest_heartbeat" field which allows us to detect when a 
backfill process dies before completing. Unfortunately the only descriptive 
information in that job record is the `dag_id` that it was run for. If we 
kicked off multiple backfills for the same dag_id ( but say with different 
dates/regex ) we can't tell which record in the job table maps to the 
particular cli command we ran.
​
For instance if we ran these two commands:
        airflow backfill --task_regex task_one --start_date 2020-01-01 
--end_date 2020-01-01 foobar
        airflow backfill --task_regex task_two --start_date 2020-02-01 
--end_date 2020-02-01 foobar

 and we query the job table for active backfills running for our dag “foobar”
        sql
        select * from job
        where job_type = ‘BackfillJob'
        and dag_id = ‘foobar'
        and state = ‘running'
        and latest_heartbeat >= NOW() - interval '3 minutes'

We would get the following results: 
        id  | dag_id | state   | job_type    | start_date                    | 
end_date | latest_heartbeat             | executor_class | hostname     | 
unixname
        123 | foobar | running | BackfillJob | 2020-03-31 17:29:04.142715+00 | 
[null]   | 2020-03-31 21:42:56.47259+00 | CeleryExecutor | 4ff5e55dc67b | 
airflow
        124 | foobar | running | BackfillJob | 2020-03-31 19:49:08.111486+00 | 
[null]   | 2020-03-31 21:42:57.63451+00 | CeleryExecutor | 4ff5e55dc67b | 
airflow

but we have no way of knowing which one maps to which cli command that was run.
​
We've poured over the database for what might be linked to the job ID and come 
up empty. We know the worker itself knows the ID because we traced queries 
being run from the worker against the airflow database and saw this query 
coming from the worker where we ran the cli command:
        SELECT job.id <http://job.id/> AS job_id, job.dag_id AS job_dag_id, 
job.state AS job_state, job.job_type AS job_job_type, job.start_date AS 
job_start_date, job.end_date AS job_end_date, job.latest_heartbeat AS 
job_latest_heartbeat, 
        job.executor_class AS job_executor_class, job.hostname AS job_hostname, 
job.unixname AS job_unixname 
        FROM job 
        WHERE job.id <http://job.id/> = 123 
        LIMIT 1

Please help us understand if it’s possible to discover the job id for a 
backfill command that was run? 
Or maybe if it’s possible to capture the job id at the moment when we kick off 
a new backfill… 


Thanks for helping,  
Victor Ignatenkov 


Reply via email to