Somewhat related to this, but likely a different issue: I've just had a case where a long (7hours) running backfill task ended up running twice somehow. We're using Celery so this might be related to some sort of Celery visibility timeout, but I haven't had a chance to be able to dig in to it in detail - it's 5pm on a Friday :D
Has anyone else noticed anything similar? -ash > On 8 Jun 2018, at 01:22, Tao Feng <fengta...@gmail.com> wrote: > > Thanks everyone for the feedback especially on the background for backfill. > After reading the discussion, I think it would be safest to add a flag for > auto rerun failed tasks for backfill with default to be false. I have > updated the pr accordingly. > > Thanks a lot, > -Tao > > On Wed, Jun 6, 2018 at 1:47 PM, Mark Whitfield <mark.whitfi...@nytimes.com> > wrote: > >> I've been doing some work setting up a large, collaborative Airflow >> pipeline with a group that makes heavy use of backfills, and have been >> encountering a lot of these issues myself. >> >> Other gripes: >> >> Backfills do not obey concurrency pool restrictions. We had been making >> heavy use of SubDAGs and using concurrency pools to prevent deadlocks (why >> does the SubDAG itself even need to occupy a concurrency slot if none of >> its constituent tasks are running?), but this quickly became untenable when >> using backfills and we were forced to mostly abandon SubDAGs. >> >> Backfills do use DagRuns now, which is a big improvement. However, it's a >> common use case for us to add new tasks to a DAG and backfill to a date >> specific to that task. When we do this, the BackfillJob will pick up >> previous backfill DagRuns and re-use them, which is mostly nice because it >> keeps the Tree view neatly organized in the UI. However, it does not reset >> the start time of the DagRun when it does this. Combined with a DAG-level >> timeout, this means that the backfill job will activate a DagRun, but then >> the run will immediately time out (since it still thinks it's been running >> since the previous backfill). This will cause tasks to deadlock spuriously, >> making backfills extremely cumbersome to carry out. >> >> *Mark Whitfield* >> Data Scientist >> New York Times >> >> >> On Wed, Jun 6, 2018 at 3:33 PM Maxime Beauchemin < >> maximebeauche...@gmail.com> >> wrote: >> >>> Thanks for the input, this is helpful. >>> >>> To add to the list, there's some complexity around concurrency management >>> and multiple executors: >>> I just hit this thing where backfill doesn't check DAG-level concurrency, >>> fires up 32 tasks, and `airlfow run` double-checks DAG-level concurrency >>> limit and exits. Right after backfill reschedules right away and so on, >>> burning a bunch of CPU doing nothing. In this specific case it seems like >>> `airflow run` should skip that specific check when in the context of a >>> backfill. >>> >>> Max >>> >>> On Tue, Jun 5, 2018 at 9:23 PM Bolke de Bruin <bdbr...@gmail.com> wrote: >>> >>>> Thinking out loud here, because it is a while back that I did work on >>>> backfills. There were some real issues with backfills: >>>> >>>> 1. Tasks were running in non deterministic order ending up in regular >>>> deadlocks >>>> 2. Didn’t create dag runs, making behavior inconsistent. Max dag runs >>>> could not be enforced. Ui could really display it, lots of minor other >>>> issues because of it. >>>> 3. Behavior was different from the scheduler, while subdagoperators >>>> particularly make use of backfills at the moment. >>>> >>>> I think with 3 the behavior you are observing crept in. And given 3 I >>>> would argue a consistent behavior between the scheduler and the >> backfill >>>> mechanism is still paramount. Thus we should explicitly clear tasks >> from >>>> failed if we want to rerun them. This at least until we move the >>>> subdagoperator out of backfill and into the scheduler (which is >> actually >>>> not too hard). Also we need those command line options anyway. >>>> >>>> Bolke >>>> >>>> Verstuurd vanaf mijn iPad >>>> >>>>> Op 6 jun. 2018 om 01:27 heeft Scott Halgrim < >> scott.halg...@zapier.com >>> .INVALID> >>>> het volgende geschreven: >>>>> >>>>> The request was for opposition, but I’d like to weigh in on the side >> of >>>> “it’s a better behavior [to have failed tasks re-run when cleared in a >>>> backfill" >>>>>> On Jun 5, 2018, 4:16 PM -0700, Maxime Beauchemin < >>>> maximebeauche...@gmail.com>, wrote: >>>>>> @Jeremiah Lowin <jlo...@gmail.com> & @Bolke de Bruin < >>> bdbr...@gmail.com> >>>> I >>>>>> think you may have some context on why this may have changed at some >>>> point. >>>>>> I'm assuming that when DagRun handling was added to the backfill >>> logic, >>>> the >>>>>> behavior just happened to change to what it is now. >>>>>> >>>>>> Any opposition in moving back towards re-running failed tasks when >>>> starting >>>>>> a backfill? I think it's a better behavior, though it's a change in >>>>>> behavior that we should mention in UPDATE.md. >>>>>> >>>>>> One of our goals is to make sure that a failed or killed backfill >> can >>> be >>>>>> restarted and just seamlessly pick up where it left off. >>>>>> >>>>>> Max >>>>>> >>>>>>> On Tue, Jun 5, 2018 at 3:25 PM Tao Feng <fengta...@gmail.com> >> wrote: >>>>>>> >>>>>>> After discussing with Max, we think it would be great if `airflow >>>> backfill` >>>>>>> could be able to auto pick up and rerun those failed tasks. >>> Currently, >>>> it >>>>>>> will throw exceptions( >>>>>>> >>>>>>> >>>> >>> https://github.com/apache/incubator-airflow/blob/master/airf >> low/jobs.py#L2489 >>>>>>> ) >>>>>>> without rerunning the failed tasks. >>>>>>> >>>>>>> But since it broke some of the previous assumptions for backfill, >> we >>>> would >>>>>>> like to get some feedback and see if anyone has any concerns(pr >> could >>>> be >>>>>>> found at https://github.com/apache/incu >> bator-airflow/pull/3464/files >>> ). >>>>>>> >>>>>>> Thanks, >>>>>>> -Tao >>>>>>> >>>>>>> On Thu, May 24, 2018 at 10:26 AM, Maxime Beauchemin < >>>>>>> maximebeauche...@gmail.com> wrote: >>>>>>> >>>>>>>> So I'm running a backfill for what feels like the first time in >>> years >>>>>>> using >>>>>>>> a simple `airflow backfill --local` commands. >>>>>>>> >>>>>>>> First I start getting a ton of `logging.info` of each tasks that >>>> cannot >>>>>>> be >>>>>>>> started just yet at every tick flooding my terminal with the >> keyword >>>>>>>> `FAILED` in it, looking like a million of lines like this one: >>>>>>>> >>>>>>>> [2018-05-24 14:33:07,852] {models.py:1123} INFO - Dependencies not >>> met >>>>>>> for >>>>>>>> <TaskInstance: some_dag.some_task_id 2018-01-28 00:00:00 >>> [scheduled]>, >>>>>>>> dependency 'Trigger Rule' FAILED: Task's trigger rule >> 'all_success' >>> re >>>>>>>> quires all upstream tasks to have succeeded, but found 1 >>>> non-success(es). >>>>>>>> upstream_tasks_state={'successes': 0L, 'failed': 0L, >>>> 'upstream_failed': >>>>>>>> 0L, >>>>>>>> 'skipped': 0L, 'done': 0L}, upstream_task_ids=['some_other >> _task_id'] >>>>>>>> >>>>>>>> Good thing I triggered 1 month and not 2 years like I actually >> need, >>>> just >>>>>>>> the logs here would be "big data". Now I'm unclear whether there's >>>>>>> anything >>>>>>>> actually running or if I did something wrong, so I decide to kill >>> the >>>>>>>> process so I can set a smaller date range and get a better picture >>> of >>>>>>>> what's up. >>>>>>>> >>>>>>>> I check my logging level, am I in DEBUG? Nope. Just INFO. So I >> take >>> a >>>>>>> note >>>>>>>> that I'll need to find that log-flooding line and demote it to >> DEBUG >>>> in a >>>>>>>> quick PR, no biggy. >>>>>>>> >>>>>>>> Now I restart with just a single schedule, and get an error `Dag >>>>>>> {some_dag} >>>>>>>> has reached maximum amount of 3 dag runs`. Hmmm, I wish backfill >>> could >>>>>>> just >>>>>>>> pickup where it left off. Maybe I need to run an `airflow clear` >>>> command >>>>>>>> and restart? Ok, ran my clear command, same error is showing up. >>> Dead >>>>>>> end. >>>>>>>> >>>>>>>> Maybe there is some new `airflow clear --reset-dagruns` option? >>>> Doesn't >>>>>>>> look like it... Maybe `airflow backfill` has some new switches to >>>> pick up >>>>>>>> where it left off? Can't find it. Am I supposed to clear the DAG >>> Runs >>>>>>>> manually in the UI? This is a pre-production, in-development DAG, >> so >>>>>>> it's >>>>>>>> not on the production web server. Am I supposed to fire up my own >>> web >>>>>>>> server to go and manually handle the backfill-related DAG Runs? >>>> Cannot to >>>>>>>> my staging MySQL and do manually clear some DAG runs? >>>>>>>> >>>>>>>> So. Fire up a web server, navigate to my dag_id, delete the DAG >>> runs, >>>> it >>>>>>>> appears I can finally start over. >>>>>>>> >>>>>>>> Next thought was: "Alright looks like I need to go Linus on the >>>> mailing >>>>>>>> list". >>>>>>>> >>>>>>>> What am I missing? I'm really hoping these issues specific to >> 1.8.2! >>>>>>>> >>>>>>>> Backfilling is core to Airflow and should work very well. I want >> to >>>>>>> restate >>>>>>>> some reqs for Airflow backfill: >>>>>>>> * when failing / interrupted, it should seamlessly be able to >> pickup >>>>>>> where >>>>>>>> it left off >>>>>>>> * terminal logging at the INFO level should be a clear, human >>>> consumable, >>>>>>>> indicator of progress >>>>>>>> * backfill-related operations (including restarts) should be >> doable >>>>>>> through >>>>>>>> CLI interactions, and not require web server interactions as the >>>> typical >>>>>>>> sandbox (dev environment) shouldn't assume the existence of a web >>>> server >>>>>>>> >>>>>>>> Let's fix this. >>>>>>>> >>>>>>>> Max >>>>>>>> >>>>>>> >>>> >>> >>