Thanks for the input, this is helpful. To add to the list, there's some complexity around concurrency management and multiple executors: I just hit this thing where backfill doesn't check DAG-level concurrency, fires up 32 tasks, and `airlfow run` double-checks DAG-level concurrency limit and exits. Right after backfill reschedules right away and so on, burning a bunch of CPU doing nothing. In this specific case it seems like `airflow run` should skip that specific check when in the context of a backfill.
Max On Tue, Jun 5, 2018 at 9:23 PM Bolke de Bruin <bdbr...@gmail.com> wrote: > Thinking out loud here, because it is a while back that I did work on > backfills. There were some real issues with backfills: > > 1. Tasks were running in non deterministic order ending up in regular > deadlocks > 2. Didn’t create dag runs, making behavior inconsistent. Max dag runs > could not be enforced. Ui could really display it, lots of minor other > issues because of it. > 3. Behavior was different from the scheduler, while subdagoperators > particularly make use of backfills at the moment. > > I think with 3 the behavior you are observing crept in. And given 3 I > would argue a consistent behavior between the scheduler and the backfill > mechanism is still paramount. Thus we should explicitly clear tasks from > failed if we want to rerun them. This at least until we move the > subdagoperator out of backfill and into the scheduler (which is actually > not too hard). Also we need those command line options anyway. > > Bolke > > Verstuurd vanaf mijn iPad > > > Op 6 jun. 2018 om 01:27 heeft Scott Halgrim > > <scott.halg...@zapier.com.INVALID> > het volgende geschreven: > > > > The request was for opposition, but I’d like to weigh in on the side of > “it’s a better behavior [to have failed tasks re-run when cleared in a > backfill" > >> On Jun 5, 2018, 4:16 PM -0700, Maxime Beauchemin < > maximebeauche...@gmail.com>, wrote: > >> @Jeremiah Lowin <jlo...@gmail.com> & @Bolke de Bruin <bdbr...@gmail.com> > I > >> think you may have some context on why this may have changed at some > point. > >> I'm assuming that when DagRun handling was added to the backfill logic, > the > >> behavior just happened to change to what it is now. > >> > >> Any opposition in moving back towards re-running failed tasks when > starting > >> a backfill? I think it's a better behavior, though it's a change in > >> behavior that we should mention in UPDATE.md. > >> > >> One of our goals is to make sure that a failed or killed backfill can be > >> restarted and just seamlessly pick up where it left off. > >> > >> Max > >> > >>> On Tue, Jun 5, 2018 at 3:25 PM Tao Feng <fengta...@gmail.com> wrote: > >>> > >>> After discussing with Max, we think it would be great if `airflow > backfill` > >>> could be able to auto pick up and rerun those failed tasks. Currently, > it > >>> will throw exceptions( > >>> > >>> > https://github.com/apache/incubator-airflow/blob/master/airflow/jobs.py#L2489 > >>> ) > >>> without rerunning the failed tasks. > >>> > >>> But since it broke some of the previous assumptions for backfill, we > would > >>> like to get some feedback and see if anyone has any concerns(pr could > be > >>> found at https://github.com/apache/incubator-airflow/pull/3464/files). > >>> > >>> Thanks, > >>> -Tao > >>> > >>> On Thu, May 24, 2018 at 10:26 AM, Maxime Beauchemin < > >>> maximebeauche...@gmail.com> wrote: > >>> > >>>> So I'm running a backfill for what feels like the first time in years > >>> using > >>>> a simple `airflow backfill --local` commands. > >>>> > >>>> First I start getting a ton of `logging.info` of each tasks that > cannot > >>> be > >>>> started just yet at every tick flooding my terminal with the keyword > >>>> `FAILED` in it, looking like a million of lines like this one: > >>>> > >>>> [2018-05-24 14:33:07,852] {models.py:1123} INFO - Dependencies not met > >>> for > >>>> <TaskInstance: some_dag.some_task_id 2018-01-28 00:00:00 [scheduled]>, > >>>> dependency 'Trigger Rule' FAILED: Task's trigger rule 'all_success' re > >>>> quires all upstream tasks to have succeeded, but found 1 > non-success(es). > >>>> upstream_tasks_state={'successes': 0L, 'failed': 0L, > 'upstream_failed': > >>>> 0L, > >>>> 'skipped': 0L, 'done': 0L}, upstream_task_ids=['some_other_task_id'] > >>>> > >>>> Good thing I triggered 1 month and not 2 years like I actually need, > just > >>>> the logs here would be "big data". Now I'm unclear whether there's > >>> anything > >>>> actually running or if I did something wrong, so I decide to kill the > >>>> process so I can set a smaller date range and get a better picture of > >>>> what's up. > >>>> > >>>> I check my logging level, am I in DEBUG? Nope. Just INFO. So I take a > >>> note > >>>> that I'll need to find that log-flooding line and demote it to DEBUG > in a > >>>> quick PR, no biggy. > >>>> > >>>> Now I restart with just a single schedule, and get an error `Dag > >>> {some_dag} > >>>> has reached maximum amount of 3 dag runs`. Hmmm, I wish backfill could > >>> just > >>>> pickup where it left off. Maybe I need to run an `airflow clear` > command > >>>> and restart? Ok, ran my clear command, same error is showing up. Dead > >>> end. > >>>> > >>>> Maybe there is some new `airflow clear --reset-dagruns` option? > Doesn't > >>>> look like it... Maybe `airflow backfill` has some new switches to > pick up > >>>> where it left off? Can't find it. Am I supposed to clear the DAG Runs > >>>> manually in the UI? This is a pre-production, in-development DAG, so > >>> it's > >>>> not on the production web server. Am I supposed to fire up my own web > >>>> server to go and manually handle the backfill-related DAG Runs? > Cannot to > >>>> my staging MySQL and do manually clear some DAG runs? > >>>> > >>>> So. Fire up a web server, navigate to my dag_id, delete the DAG runs, > it > >>>> appears I can finally start over. > >>>> > >>>> Next thought was: "Alright looks like I need to go Linus on the > mailing > >>>> list". > >>>> > >>>> What am I missing? I'm really hoping these issues specific to 1.8.2! > >>>> > >>>> Backfilling is core to Airflow and should work very well. I want to > >>> restate > >>>> some reqs for Airflow backfill: > >>>> * when failing / interrupted, it should seamlessly be able to pickup > >>> where > >>>> it left off > >>>> * terminal logging at the INFO level should be a clear, human > consumable, > >>>> indicator of progress > >>>> * backfill-related operations (including restarts) should be doable > >>> through > >>>> CLI interactions, and not require web server interactions as the > typical > >>>> sandbox (dev environment) shouldn't assume the existence of a web > server > >>>> > >>>> Let's fix this. > >>>> > >>>> Max > >>>> > >>> >