Note that sometimes it can be convenient to run a backfill based on a
previous version or altered DAG. For example if logic has changed in the
repo but somehow need to re-run so earlier logic against some period in
2016, you may want to checkout an earlier commit and trigger a backfill
based on that logic.

Another use case would be if you're working on a brand new DAG, you may
want to run it on a month or 3 to plot / validate some data prior to
merging to master, ...

Assuming something like "DagFetcher", it'd be great for the backfill to be
remote, but allow to specify an alternate DAG artifact.

Max

On Sun, Apr 14, 2019 at 11:38 AM Driesprong, Fokko <fo...@driesprong.frl>
wrote:

> Good points James,
>
> Personally, I never use the CLI backfilling, and also recommend colleagues
> not to use it because of the points that you mention. I also resort to the
> poor man's backfilling (clearing the future and past in the UI).
>
> I'd rather get rid of the CLI, and would like to see the possibility to
> submit a backfill job through the REST API. In this case, it can be part of
> the web UI, but you could also write a CLI tool if that is your thing :-)
>
> Cheers, Fokko
>
> Op za 13 apr. 2019 om 23:26 schreef Maxime Beauchemin <
> maximebeauche...@gmail.com>:
>
> > +1, backfilling, and related "subdag surgeries" are core to a data
> > engineer's job, and great tooling around this is super important.
> Backfill
> > needs more TLC!
> >
> > Max
> >
> > On Fri, Apr 12, 2019 at 11:48 PM Chao-Han Tsai <milton0...@gmail.com>
> > wrote:
> >
> > > +1 on improving backfill.
> > >
> > > - The terminal interface was uselessly verbose. It was scrolling fast
> > > > enough to be unreadable.
> > >
> > >
> > > I agree that backfill is currently too verbose. It simply logs to many
> > > things and it is hard to read. Often times, I only care about the
> number
> > of
> > > tasks/dagruns that are in-progress/finished/not started. I had a PR
> > > <https://github.com/apache/airflow/pull/3478> that implements a
> progress
> > > bar for backfill but was not able to finish. Probably something that
> can
> > > help improve the backfill experience.
> > >
> > > - The backfill exceeded safe concurrency limits for the cluster and
> > > > could've easily brought it down if I'd left it running.
> > >
> > >
> > > Btw. backfill now respects pool limitation but we should probably
> looking
> > > into making it respect concurrency limit.
> > >
> > > Chao-Han
> > >
> > > >
> > > >
> > >
> > >
> > > On Mon, Mar 4, 2019 at 12:35 PM James Meickle
> > > <jmeic...@quantopian.com.invalid> wrote:
> > >
> > > > This is an old thread, but I wanted to bump it as I just had a really
> > bad
> > > > experience using backfill. I'd been hesitant to even try backfills
> out
> > > > given what I've read about it, so I've just relied on the UI to
> "Clear"
> > > > entire tasks. However, I wanted to give it a shot the "right" way.
> > > Issues I
> > > > ran into:
> > > >
> > > > - The dry run flag didn't give good feedback about which dagruns and
> > task
> > > > instances will be affected (and is very easy to typo as "--dry-run")
> > > >
> > > > - The terminal interface was uselessly verbose. It was scrolling fast
> > > > enough to be unreadable.
> > > >
> > > > - The backfill exceeded safe concurrency limits for the cluster and
> > > > could've easily brought it down if I'd left it running.
> > > >
> > > > - Tasks in the backfill were executed out of order despite the tasks
> > > having
> > > > `depends_on_past`
> > > >
> > > > - The backfill converted all existing DAGRuns to be backfill runs
> that
> > > the
> > > > scheduler later ignored, which is not how I would've expected this to
> > > work
> > > > (nor was it indicated in the dry run)
> > > >
> > > > I ended up having to do manual recovery work in the database to turn
> > the
> > > > "backfill" runs back into scheduler runs, and then switch to using
> > > `airflow
> > > > clear`. I'm a heavy Airflow user and this took me an hour; it
> would've
> > > been
> > > > much worse for anyone else on my team.
> > > >
> > > > I don't have any specific suggestions here other than to confirm that
> > > this
> > > > feature needs an overhaul if it's to be recommended to anyone.
> > > >
> > > > On Fri, Jun 8, 2018 at 5:38 PM Maxime Beauchemin <
> > > > maximebeauche...@gmail.com>
> > > > wrote:
> > > >
> > > > > Ash I don't see how this could happen unless maybe the node doing
> the
> > > > > backfill is using another metadata database.
> > > > >
> > > > > In general we recommend for people to run --local backfills and
> have
> > > the
> > > > > default/sandbox template for `airflow.cfg` use a LocalExecutor with
> > > > > reasonable parallelism to make that behavior the default.
> > > > >
> > > > > Given the [not-so-great] state of backfill, I'm guessing many have
> > been
> > > > > using the scheduler to do backfills. From that regard it would be
> > nice
> > > to
> > > > > have CLI commands to generate dagruns or alter the state of
> existing
> > > ones
> > > > >
> > > > > Max
> > > > >
> > > > > On Fri, Jun 8, 2018 at 8:56 AM Ash Berlin-Taylor <
> > > > > ash_airflowl...@firemirror.com> wrote:
> > > > >
> > > > > > Somewhat related to this, but likely a different issue:
> > > > > >
> > > > > > I've just had a case where a long (7hours) running backfill task
> > > ended
> > > > up
> > > > > > running twice somehow. We're using Celery so this might be
> related
> > to
> > > > > some
> > > > > > sort of Celery visibility timeout, but I haven't had a chance to
> be
> > > > able
> > > > > to
> > > > > > dig in to it in detail - it's 5pm on a Friday :D
> > > > > >
> > > > > > Has anyone else noticed anything similar?
> > > > > >
> > > > > > -ash
> > > > > >
> > > > > >
> > > > > > > On 8 Jun 2018, at 01:22, Tao Feng <fengta...@gmail.com> wrote:
> > > > > > >
> > > > > > > Thanks everyone for the feedback especially on the background
> for
> > > > > > backfill.
> > > > > > > After reading the discussion, I think it would be safest to
> add a
> > > > flag
> > > > > > for
> > > > > > > auto rerun failed tasks for backfill with default to be false.
> I
> > > have
> > > > > > > updated the pr accordingly.
> > > > > > >
> > > > > > > Thanks a lot,
> > > > > > > -Tao
> > > > > > >
> > > > > > > On Wed, Jun 6, 2018 at 1:47 PM, Mark Whitfield <
> > > > > > mark.whitfi...@nytimes.com>
> > > > > > > wrote:
> > > > > > >
> > > > > > >> I've been doing some work setting up a large, collaborative
> > > Airflow
> > > > > > >> pipeline with a group that makes heavy use of backfills, and
> > have
> > > > been
> > > > > > >> encountering a lot of these issues myself.
> > > > > > >>
> > > > > > >> Other gripes:
> > > > > > >>
> > > > > > >> Backfills do not obey concurrency pool restrictions. We had
> been
> > > > > making
> > > > > > >> heavy use of SubDAGs and using concurrency pools to prevent
> > > > deadlocks
> > > > > > (why
> > > > > > >> does the SubDAG itself even need to occupy a concurrency slot
> if
> > > > none
> > > > > of
> > > > > > >> its constituent tasks are running?), but this quickly became
> > > > untenable
> > > > > > when
> > > > > > >> using backfills and we were forced to mostly abandon SubDAGs.
> > > > > > >>
> > > > > > >> Backfills do use DagRuns now, which is a big improvement.
> > However,
> > > > > it's
> > > > > > a
> > > > > > >> common use case for us to add new tasks to a DAG and backfill
> > to a
> > > > > date
> > > > > > >> specific to that task. When we do this, the BackfillJob will
> > pick
> > > up
> > > > > > >> previous backfill DagRuns and re-use them, which is mostly
> nice
> > > > > because
> > > > > > it
> > > > > > >> keeps the Tree view neatly organized in the UI. However, it
> does
> > > not
> > > > > > reset
> > > > > > >> the start time of the DagRun when it does this. Combined with
> a
> > > > > > DAG-level
> > > > > > >> timeout, this means that the backfill job will activate a
> > DagRun,
> > > > but
> > > > > > then
> > > > > > >> the run will immediately time out (since it still thinks it's
> > been
> > > > > > running
> > > > > > >> since the previous backfill). This will cause tasks to
> deadlock
> > > > > > spuriously,
> > > > > > >> making backfills extremely cumbersome to carry out.
> > > > > > >>
> > > > > > >> *Mark Whitfield*
> > > > > > >> Data Scientist
> > > > > > >> New York Times
> > > > > > >>
> > > > > > >>
> > > > > > >> On Wed, Jun 6, 2018 at 3:33 PM Maxime Beauchemin <
> > > > > > >> maximebeauche...@gmail.com>
> > > > > > >> wrote:
> > > > > > >>
> > > > > > >>> Thanks for the input, this is helpful.
> > > > > > >>>
> > > > > > >>> To add to the list, there's some complexity around
> concurrency
> > > > > > management
> > > > > > >>> and multiple executors:
> > > > > > >>> I just hit this thing where backfill doesn't check DAG-level
> > > > > > concurrency,
> > > > > > >>> fires up 32 tasks, and `airlfow run` double-checks DAG-level
> > > > > > concurrency
> > > > > > >>> limit and exits. Right after backfill reschedules right away
> > and
> > > so
> > > > > on,
> > > > > > >>> burning a bunch of CPU doing nothing. In this specific case
> it
> > > > seems
> > > > > > like
> > > > > > >>> `airflow run` should skip that specific check when in the
> > context
> > > > of
> > > > > a
> > > > > > >>> backfill.
> > > > > > >>>
> > > > > > >>> Max
> > > > > > >>>
> > > > > > >>> On Tue, Jun 5, 2018 at 9:23 PM Bolke de Bruin <
> > bdbr...@gmail.com
> > > >
> > > > > > wrote:
> > > > > > >>>
> > > > > > >>>> Thinking out loud here, because it is a while back that I
> did
> > > work
> > > > > on
> > > > > > >>>> backfills. There were some real issues with backfills:
> > > > > > >>>>
> > > > > > >>>> 1. Tasks were running in non deterministic order ending up
> in
> > > > > regular
> > > > > > >>>> deadlocks
> > > > > > >>>> 2. Didn’t create dag runs, making behavior inconsistent. Max
> > dag
> > > > > runs
> > > > > > >>>> could not be enforced. Ui could really display it, lots of
> > minor
> > > > > other
> > > > > > >>>> issues because of it.
> > > > > > >>>> 3. Behavior was different from the scheduler, while
> > > > subdagoperators
> > > > > > >>>> particularly make use of backfills at the moment.
> > > > > > >>>>
> > > > > > >>>> I think with 3 the behavior you are observing crept in. And
> > > given
> > > > 3
> > > > > I
> > > > > > >>>> would argue a consistent behavior between the scheduler and
> > the
> > > > > > >> backfill
> > > > > > >>>> mechanism is still paramount. Thus we should explicitly
> clear
> > > > tasks
> > > > > > >> from
> > > > > > >>>> failed if we want to rerun them. This at least until we move
> > the
> > > > > > >>>> subdagoperator out of backfill and into the scheduler (which
> > is
> > > > > > >> actually
> > > > > > >>>> not too hard). Also we need those command line options
> anyway.
> > > > > > >>>>
> > > > > > >>>> Bolke
> > > > > > >>>>
> > > > > > >>>> Verstuurd vanaf mijn iPad
> > > > > > >>>>
> > > > > > >>>>> Op 6 jun. 2018 om 01:27 heeft Scott Halgrim <
> > > > > > >> scott.halg...@zapier.com
> > > > > > >>> .INVALID>
> > > > > > >>>> het volgende geschreven:
> > > > > > >>>>>
> > > > > > >>>>> The request was for opposition, but I’d like to weigh in on
> > the
> > > > > side
> > > > > > >> of
> > > > > > >>>> “it’s a better behavior [to have failed tasks re-run when
> > > cleared
> > > > > in a
> > > > > > >>>> backfill"
> > > > > > >>>>>> On Jun 5, 2018, 4:16 PM -0700, Maxime Beauchemin <
> > > > > > >>>> maximebeauche...@gmail.com>, wrote:
> > > > > > >>>>>> @Jeremiah Lowin <jlo...@gmail.com> & @Bolke de Bruin <
> > > > > > >>> bdbr...@gmail.com>
> > > > > > >>>> I
> > > > > > >>>>>> think you may have some context on why this may have
> changed
> > > at
> > > > > some
> > > > > > >>>> point.
> > > > > > >>>>>> I'm assuming that when DagRun handling was added to the
> > > backfill
> > > > > > >>> logic,
> > > > > > >>>> the
> > > > > > >>>>>> behavior just happened to change to what it is now.
> > > > > > >>>>>>
> > > > > > >>>>>> Any opposition in moving back towards re-running failed
> > tasks
> > > > when
> > > > > > >>>> starting
> > > > > > >>>>>> a backfill? I think it's a better behavior, though it's a
> > > change
> > > > > in
> > > > > > >>>>>> behavior that we should mention in UPDATE.md.
> > > > > > >>>>>>
> > > > > > >>>>>> One of our goals is to make sure that a failed or killed
> > > > backfill
> > > > > > >> can
> > > > > > >>> be
> > > > > > >>>>>> restarted and just seamlessly pick up where it left off.
> > > > > > >>>>>>
> > > > > > >>>>>> Max
> > > > > > >>>>>>
> > > > > > >>>>>>> On Tue, Jun 5, 2018 at 3:25 PM Tao Feng <
> > fengta...@gmail.com
> > > >
> > > > > > >> wrote:
> > > > > > >>>>>>>
> > > > > > >>>>>>> After discussing with Max, we think it would be great if
> > > > `airflow
> > > > > > >>>> backfill`
> > > > > > >>>>>>> could be able to auto pick up and rerun those failed
> tasks.
> > > > > > >>> Currently,
> > > > > > >>>> it
> > > > > > >>>>>>> will throw exceptions(
> > > > > > >>>>>>>
> > > > > > >>>>>>>
> > > > > > >>>>
> > > > > > >>> https://github.com/apache/incubator-airflow/blob/master/airf
> > > > > > >> low/jobs.py#L2489
> > > > > > >>>>>>> )
> > > > > > >>>>>>> without rerunning the failed tasks.
> > > > > > >>>>>>>
> > > > > > >>>>>>> But since it broke some of the previous assumptions for
> > > > backfill,
> > > > > > >> we
> > > > > > >>>> would
> > > > > > >>>>>>> like to get some feedback and see if anyone has any
> > > concerns(pr
> > > > > > >> could
> > > > > > >>>> be
> > > > > > >>>>>>> found at https://github.com/apache/incu
> > > > > > >> bator-airflow/pull/3464/files
> > > > > > >>> ).
> > > > > > >>>>>>>
> > > > > > >>>>>>> Thanks,
> > > > > > >>>>>>> -Tao
> > > > > > >>>>>>>
> > > > > > >>>>>>> On Thu, May 24, 2018 at 10:26 AM, Maxime Beauchemin <
> > > > > > >>>>>>> maximebeauche...@gmail.com> wrote:
> > > > > > >>>>>>>
> > > > > > >>>>>>>> So I'm running a backfill for what feels like the first
> > time
> > > > in
> > > > > > >>> years
> > > > > > >>>>>>> using
> > > > > > >>>>>>>> a simple `airflow backfill --local` commands.
> > > > > > >>>>>>>>
> > > > > > >>>>>>>> First I start getting a ton of `logging.info` of each
> > tasks
> > > > > that
> > > > > > >>>> cannot
> > > > > > >>>>>>> be
> > > > > > >>>>>>>> started just yet at every tick flooding my terminal with
> > the
> > > > > > >> keyword
> > > > > > >>>>>>>> `FAILED` in it, looking like a million of lines like
> this
> > > one:
> > > > > > >>>>>>>>
> > > > > > >>>>>>>> [2018-05-24 14:33:07,852] {models.py:1123} INFO -
> > > Dependencies
> > > > > not
> > > > > > >>> met
> > > > > > >>>>>>> for
> > > > > > >>>>>>>> <TaskInstance: some_dag.some_task_id 2018-01-28 00:00:00
> > > > > > >>> [scheduled]>,
> > > > > > >>>>>>>> dependency 'Trigger Rule' FAILED: Task's trigger rule
> > > > > > >> 'all_success'
> > > > > > >>> re
> > > > > > >>>>>>>> quires all upstream tasks to have succeeded, but found 1
> > > > > > >>>> non-success(es).
> > > > > > >>>>>>>> upstream_tasks_state={'successes': 0L, 'failed': 0L,
> > > > > > >>>> 'upstream_failed':
> > > > > > >>>>>>>> 0L,
> > > > > > >>>>>>>> 'skipped': 0L, 'done': 0L},
> upstream_task_ids=['some_other
> > > > > > >> _task_id']
> > > > > > >>>>>>>>
> > > > > > >>>>>>>> Good thing I triggered 1 month and not 2 years like I
> > > actually
> > > > > > >> need,
> > > > > > >>>> just
> > > > > > >>>>>>>> the logs here would be "big data". Now I'm unclear
> whether
> > > > > there's
> > > > > > >>>>>>> anything
> > > > > > >>>>>>>> actually running or if I did something wrong, so I
> decide
> > to
> > > > > kill
> > > > > > >>> the
> > > > > > >>>>>>>> process so I can set a smaller date range and get a
> better
> > > > > picture
> > > > > > >>> of
> > > > > > >>>>>>>> what's up.
> > > > > > >>>>>>>>
> > > > > > >>>>>>>> I check my logging level, am I in DEBUG? Nope. Just
> INFO.
> > > So I
> > > > > > >> take
> > > > > > >>> a
> > > > > > >>>>>>> note
> > > > > > >>>>>>>> that I'll need to find that log-flooding line and demote
> > it
> > > to
> > > > > > >> DEBUG
> > > > > > >>>> in a
> > > > > > >>>>>>>> quick PR, no biggy.
> > > > > > >>>>>>>>
> > > > > > >>>>>>>> Now I restart with just a single schedule, and get an
> > error
> > > > `Dag
> > > > > > >>>>>>> {some_dag}
> > > > > > >>>>>>>> has reached maximum amount of 3 dag runs`. Hmmm, I wish
> > > > backfill
> > > > > > >>> could
> > > > > > >>>>>>> just
> > > > > > >>>>>>>> pickup where it left off. Maybe I need to run an
> `airflow
> > > > clear`
> > > > > > >>>> command
> > > > > > >>>>>>>> and restart? Ok, ran my clear command, same error is
> > showing
> > > > up.
> > > > > > >>> Dead
> > > > > > >>>>>>> end.
> > > > > > >>>>>>>>
> > > > > > >>>>>>>> Maybe there is some new `airflow clear --reset-dagruns`
> > > > option?
> > > > > > >>>> Doesn't
> > > > > > >>>>>>>> look like it... Maybe `airflow backfill` has some new
> > > switches
> > > > > to
> > > > > > >>>> pick up
> > > > > > >>>>>>>> where it left off? Can't find it. Am I supposed to clear
> > the
> > > > DAG
> > > > > > >>> Runs
> > > > > > >>>>>>>> manually in the UI? This is a pre-production,
> > in-development
> > > > > DAG,
> > > > > > >> so
> > > > > > >>>>>>> it's
> > > > > > >>>>>>>> not on the production web server. Am I supposed to fire
> up
> > > my
> > > > > own
> > > > > > >>> web
> > > > > > >>>>>>>> server to go and manually handle the backfill-related
> DAG
> > > > Runs?
> > > > > > >>>> Cannot to
> > > > > > >>>>>>>> my staging MySQL and do manually clear some DAG runs?
> > > > > > >>>>>>>>
> > > > > > >>>>>>>> So. Fire up a web server, navigate to my dag_id, delete
> > the
> > > > DAG
> > > > > > >>> runs,
> > > > > > >>>> it
> > > > > > >>>>>>>> appears I can finally start over.
> > > > > > >>>>>>>>
> > > > > > >>>>>>>> Next thought was: "Alright looks like I need to go Linus
> > on
> > > > the
> > > > > > >>>> mailing
> > > > > > >>>>>>>> list".
> > > > > > >>>>>>>>
> > > > > > >>>>>>>> What am I missing? I'm really hoping these issues
> specific
> > > to
> > > > > > >> 1.8.2!
> > > > > > >>>>>>>>
> > > > > > >>>>>>>> Backfilling is core to Airflow and should work very
> well.
> > I
> > > > want
> > > > > > >> to
> > > > > > >>>>>>> restate
> > > > > > >>>>>>>> some reqs for Airflow backfill:
> > > > > > >>>>>>>> * when failing / interrupted, it should seamlessly be
> able
> > > to
> > > > > > >> pickup
> > > > > > >>>>>>> where
> > > > > > >>>>>>>> it left off
> > > > > > >>>>>>>> * terminal logging at the INFO level should be a clear,
> > > human
> > > > > > >>>> consumable,
> > > > > > >>>>>>>> indicator of progress
> > > > > > >>>>>>>> * backfill-related operations (including restarts)
> should
> > be
> > > > > > >> doable
> > > > > > >>>>>>> through
> > > > > > >>>>>>>> CLI interactions, and not require web server
> interactions
> > as
> > > > the
> > > > > > >>>> typical
> > > > > > >>>>>>>> sandbox (dev environment) shouldn't assume the existence
> > of
> > > a
> > > > > web
> > > > > > >>>> server
> > > > > > >>>>>>>>
> > > > > > >>>>>>>> Let's fix this.
> > > > > > >>>>>>>>
> > > > > > >>>>>>>> Max
> > > > > > >>>>>>>>
> > > > > > >>>>>>>
> > > > > > >>>>
> > > > > > >>>
> > > > > > >>
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> > >
> > > --
> > >
> > > Chao-Han Tsai
> > >
> >
>

Reply via email to