+1 on the backfill CLI command being a wrapper around submitting a job to
the REST API.

Since backfills run client-side as a CLI command, if something goes wrong
on that node temporarily then the backfill will get killed and never
restart. When a backfill dies over the night and you have to restart it in
the morning it is super painful knowing you wasted a bunch of time.



On Sun, Apr 14, 2019 at 1:38 PM Driesprong, Fokko <fo...@driesprong.frl>
wrote:

> Good points James,
>
> Personally, I never use the CLI backfilling, and also recommend colleagues
> not to use it because of the points that you mention. I also resort to the
> poor man's backfilling (clearing the future and past in the UI).
>
> I'd rather get rid of the CLI, and would like to see the possibility to
> submit a backfill job through the REST API. In this case, it can be part of
> the web UI, but you could also write a CLI tool if that is your thing :-)
>
> Cheers, Fokko
>
> Op za 13 apr. 2019 om 23:26 schreef Maxime Beauchemin <
> maximebeauche...@gmail.com>:
>
> > +1, backfilling, and related "subdag surgeries" are core to a data
> > engineer's job, and great tooling around this is super important.
> Backfill
> > needs more TLC!
> >
> > Max
> >
> > On Fri, Apr 12, 2019 at 11:48 PM Chao-Han Tsai <milton0...@gmail.com>
> > wrote:
> >
> > > +1 on improving backfill.
> > >
> > > - The terminal interface was uselessly verbose. It was scrolling fast
> > > > enough to be unreadable.
> > >
> > >
> > > I agree that backfill is currently too verbose. It simply logs to many
> > > things and it is hard to read. Often times, I only care about the
> number
> > of
> > > tasks/dagruns that are in-progress/finished/not started. I had a PR
> > > <https://github.com/apache/airflow/pull/3478> that implements a
> progress
> > > bar for backfill but was not able to finish. Probably something that
> can
> > > help improve the backfill experience.
> > >
> > > - The backfill exceeded safe concurrency limits for the cluster and
> > > > could've easily brought it down if I'd left it running.
> > >
> > >
> > > Btw. backfill now respects pool limitation but we should probably
> looking
> > > into making it respect concurrency limit.
> > >
> > > Chao-Han
> > >
> > > >
> > > >
> > >
> > >
> > > On Mon, Mar 4, 2019 at 12:35 PM James Meickle
> > > <jmeic...@quantopian.com.invalid> wrote:
> > >
> > > > This is an old thread, but I wanted to bump it as I just had a really
> > bad
> > > > experience using backfill. I'd been hesitant to even try backfills
> out
> > > > given what I've read about it, so I've just relied on the UI to
> "Clear"
> > > > entire tasks. However, I wanted to give it a shot the "right" way.
> > > Issues I
> > > > ran into:
> > > >
> > > > - The dry run flag didn't give good feedback about which dagruns and
> > task
> > > > instances will be affected (and is very easy to typo as "--dry-run")
> > > >
> > > > - The terminal interface was uselessly verbose. It was scrolling fast
> > > > enough to be unreadable.
> > > >
> > > > - The backfill exceeded safe concurrency limits for the cluster and
> > > > could've easily brought it down if I'd left it running.
> > > >
> > > > - Tasks in the backfill were executed out of order despite the tasks
> > > having
> > > > `depends_on_past`
> > > >
> > > > - The backfill converted all existing DAGRuns to be backfill runs
> that
> > > the
> > > > scheduler later ignored, which is not how I would've expected this to
> > > work
> > > > (nor was it indicated in the dry run)
> > > >
> > > > I ended up having to do manual recovery work in the database to turn
> > the
> > > > "backfill" runs back into scheduler runs, and then switch to using
> > > `airflow
> > > > clear`. I'm a heavy Airflow user and this took me an hour; it
> would've
> > > been
> > > > much worse for anyone else on my team.
> > > >
> > > > I don't have any specific suggestions here other than to confirm that
> > > this
> > > > feature needs an overhaul if it's to be recommended to anyone.
> > > >
> > > > On Fri, Jun 8, 2018 at 5:38 PM Maxime Beauchemin <
> > > > maximebeauche...@gmail.com>
> > > > wrote:
> > > >
> > > > > Ash I don't see how this could happen unless maybe the node doing
> the
> > > > > backfill is using another metadata database.
> > > > >
> > > > > In general we recommend for people to run --local backfills and
> have
> > > the
> > > > > default/sandbox template for `airflow.cfg` use a LocalExecutor with
> > > > > reasonable parallelism to make that behavior the default.
> > > > >
> > > > > Given the [not-so-great] state of backfill, I'm guessing many have
> > been
> > > > > using the scheduler to do backfills. From that regard it would be
> > nice
> > > to
> > > > > have CLI commands to generate dagruns or alter the state of
> existing
> > > ones
> > > > >
> > > > > Max
> > > > >
> > > > > On Fri, Jun 8, 2018 at 8:56 AM Ash Berlin-Taylor <
> > > > > ash_airflowl...@firemirror.com> wrote:
> > > > >
> > > > > > Somewhat related to this, but likely a different issue:
> > > > > >
> > > > > > I've just had a case where a long (7hours) running backfill task
> > > ended
> > > > up
> > > > > > running twice somehow. We're using Celery so this might be
> related
> > to
> > > > > some
> > > > > > sort of Celery visibility timeout, but I haven't had a chance to
> be
> > > > able
> > > > > to
> > > > > > dig in to it in detail - it's 5pm on a Friday :D
> > > > > >
> > > > > > Has anyone else noticed anything similar?
> > > > > >
> > > > > > -ash
> > > > > >
> > > > > >
> > > > > > > On 8 Jun 2018, at 01:22, Tao Feng <fengta...@gmail.com> wrote:
> > > > > > >
> > > > > > > Thanks everyone for the feedback especially on the background
> for
> > > > > > backfill.
> > > > > > > After reading the discussion, I think it would be safest to
> add a
> > > > flag
> > > > > > for
> > > > > > > auto rerun failed tasks for backfill with default to be false.
> I
> > > have
> > > > > > > updated the pr accordingly.
> > > > > > >
> > > > > > > Thanks a lot,
> > > > > > > -Tao
> > > > > > >
> > > > > > > On Wed, Jun 6, 2018 at 1:47 PM, Mark Whitfield <
> > > > > > mark.whitfi...@nytimes.com>
> > > > > > > wrote:
> > > > > > >
> > > > > > >> I've been doing some work setting up a large, collaborative
> > > Airflow
> > > > > > >> pipeline with a group that makes heavy use of backfills, and
> > have
> > > > been
> > > > > > >> encountering a lot of these issues myself.
> > > > > > >>
> > > > > > >> Other gripes:
> > > > > > >>
> > > > > > >> Backfills do not obey concurrency pool restrictions. We had
> been
> > > > > making
> > > > > > >> heavy use of SubDAGs and using concurrency pools to prevent
> > > > deadlocks
> > > > > > (why
> > > > > > >> does the SubDAG itself even need to occupy a concurrency slot
> if
> > > > none
> > > > > of
> > > > > > >> its constituent tasks are running?), but this quickly became
> > > > untenable
> > > > > > when
> > > > > > >> using backfills and we were forced to mostly abandon SubDAGs.
> > > > > > >>
> > > > > > >> Backfills do use DagRuns now, which is a big improvement.
> > However,
> > > > > it's
> > > > > > a
> > > > > > >> common use case for us to add new tasks to a DAG and backfill
> > to a
> > > > > date
> > > > > > >> specific to that task. When we do this, the BackfillJob will
> > pick
> > > up
> > > > > > >> previous backfill DagRuns and re-use them, which is mostly
> nice
> > > > > because
> > > > > > it
> > > > > > >> keeps the Tree view neatly organized in the UI. However, it
> does
> > > not
> > > > > > reset
> > > > > > >> the start time of the DagRun when it does this. Combined with
> a
> > > > > > DAG-level
> > > > > > >> timeout, this means that the backfill job will activate a
> > DagRun,
> > > > but
> > > > > > then
> > > > > > >> the run will immediately time out (since it still thinks it's
> > been
> > > > > > running
> > > > > > >> since the previous backfill). This will cause tasks to
> deadlock
> > > > > > spuriously,
> > > > > > >> making backfills extremely cumbersome to carry out.
> > > > > > >>
> > > > > > >> *Mark Whitfield*
> > > > > > >> Data Scientist
> > > > > > >> New York Times
> > > > > > >>
> > > > > > >>
> > > > > > >> On Wed, Jun 6, 2018 at 3:33 PM Maxime Beauchemin <
> > > > > > >> maximebeauche...@gmail.com>
> > > > > > >> wrote:
> > > > > > >>
> > > > > > >>> Thanks for the input, this is helpful.
> > > > > > >>>
> > > > > > >>> To add to the list, there's some complexity around
> concurrency
> > > > > > management
> > > > > > >>> and multiple executors:
> > > > > > >>> I just hit this thing where backfill doesn't check DAG-level
> > > > > > concurrency,
> > > > > > >>> fires up 32 tasks, and `airlfow run` double-checks DAG-level
> > > > > > concurrency
> > > > > > >>> limit and exits. Right after backfill reschedules right away
> > and
> > > so
> > > > > on,
> > > > > > >>> burning a bunch of CPU doing nothing. In this specific case
> it
> > > > seems
> > > > > > like
> > > > > > >>> `airflow run` should skip that specific check when in the
> > context
> > > > of
> > > > > a
> > > > > > >>> backfill.
> > > > > > >>>
> > > > > > >>> Max
> > > > > > >>>
> > > > > > >>> On Tue, Jun 5, 2018 at 9:23 PM Bolke de Bruin <
> > bdbr...@gmail.com
> > > >
> > > > > > wrote:
> > > > > > >>>
> > > > > > >>>> Thinking out loud here, because it is a while back that I
> did
> > > work
> > > > > on
> > > > > > >>>> backfills. There were some real issues with backfills:
> > > > > > >>>>
> > > > > > >>>> 1. Tasks were running in non deterministic order ending up
> in
> > > > > regular
> > > > > > >>>> deadlocks
> > > > > > >>>> 2. Didn’t create dag runs, making behavior inconsistent. Max
> > dag
> > > > > runs
> > > > > > >>>> could not be enforced. Ui could really display it, lots of
> > minor
> > > > > other
> > > > > > >>>> issues because of it.
> > > > > > >>>> 3. Behavior was different from the scheduler, while
> > > > subdagoperators
> > > > > > >>>> particularly make use of backfills at the moment.
> > > > > > >>>>
> > > > > > >>>> I think with 3 the behavior you are observing crept in. And
> > > given
> > > > 3
> > > > > I
> > > > > > >>>> would argue a consistent behavior between the scheduler and
> > the
> > > > > > >> backfill
> > > > > > >>>> mechanism is still paramount. Thus we should explicitly
> clear
> > > > tasks
> > > > > > >> from
> > > > > > >>>> failed if we want to rerun them. This at least until we move
> > the
> > > > > > >>>> subdagoperator out of backfill and into the scheduler (which
> > is
> > > > > > >> actually
> > > > > > >>>> not too hard). Also we need those command line options
> anyway.
> > > > > > >>>>
> > > > > > >>>> Bolke
> > > > > > >>>>
> > > > > > >>>> Verstuurd vanaf mijn iPad
> > > > > > >>>>
> > > > > > >>>>> Op 6 jun. 2018 om 01:27 heeft Scott Halgrim <
> > > > > > >> scott.halg...@zapier.com
> > > > > > >>> .INVALID>
> > > > > > >>>> het volgende geschreven:
> > > > > > >>>>>
> > > > > > >>>>> The request was for opposition, but I’d like to weigh in on
> > the
> > > > > side
> > > > > > >> of
> > > > > > >>>> “it’s a better behavior [to have failed tasks re-run when
> > > cleared
> > > > > in a
> > > > > > >>>> backfill"
> > > > > > >>>>>> On Jun 5, 2018, 4:16 PM -0700, Maxime Beauchemin <
> > > > > > >>>> maximebeauche...@gmail.com>, wrote:
> > > > > > >>>>>> @Jeremiah Lowin <jlo...@gmail.com> & @Bolke de Bruin <
> > > > > > >>> bdbr...@gmail.com>
> > > > > > >>>> I
> > > > > > >>>>>> think you may have some context on why this may have
> changed
> > > at
> > > > > some
> > > > > > >>>> point.
> > > > > > >>>>>> I'm assuming that when DagRun handling was added to the
> > > backfill
> > > > > > >>> logic,
> > > > > > >>>> the
> > > > > > >>>>>> behavior just happened to change to what it is now.
> > > > > > >>>>>>
> > > > > > >>>>>> Any opposition in moving back towards re-running failed
> > tasks
> > > > when
> > > > > > >>>> starting
> > > > > > >>>>>> a backfill? I think it's a better behavior, though it's a
> > > change
> > > > > in
> > > > > > >>>>>> behavior that we should mention in UPDATE.md.
> > > > > > >>>>>>
> > > > > > >>>>>> One of our goals is to make sure that a failed or killed
> > > > backfill
> > > > > > >> can
> > > > > > >>> be
> > > > > > >>>>>> restarted and just seamlessly pick up where it left off.
> > > > > > >>>>>>
> > > > > > >>>>>> Max
> > > > > > >>>>>>
> > > > > > >>>>>>> On Tue, Jun 5, 2018 at 3:25 PM Tao Feng <
> > fengta...@gmail.com
> > > >
> > > > > > >> wrote:
> > > > > > >>>>>>>
> > > > > > >>>>>>> After discussing with Max, we think it would be great if
> > > > `airflow
> > > > > > >>>> backfill`
> > > > > > >>>>>>> could be able to auto pick up and rerun those failed
> tasks.
> > > > > > >>> Currently,
> > > > > > >>>> it
> > > > > > >>>>>>> will throw exceptions(
> > > > > > >>>>>>>
> > > > > > >>>>>>>
> > > > > > >>>>
> > > > > > >>> https://github.com/apache/incubator-airflow/blob/master/airf
> > > > > > >> low/jobs.py#L2489
> > > > > > >>>>>>> )
> > > > > > >>>>>>> without rerunning the failed tasks.
> > > > > > >>>>>>>
> > > > > > >>>>>>> But since it broke some of the previous assumptions for
> > > > backfill,
> > > > > > >> we
> > > > > > >>>> would
> > > > > > >>>>>>> like to get some feedback and see if anyone has any
> > > concerns(pr
> > > > > > >> could
> > > > > > >>>> be
> > > > > > >>>>>>> found at https://github.com/apache/incu
> > > > > > >> bator-airflow/pull/3464/files
> > > > > > >>> ).
> > > > > > >>>>>>>
> > > > > > >>>>>>> Thanks,
> > > > > > >>>>>>> -Tao
> > > > > > >>>>>>>
> > > > > > >>>>>>> On Thu, May 24, 2018 at 10:26 AM, Maxime Beauchemin <
> > > > > > >>>>>>> maximebeauche...@gmail.com> wrote:
> > > > > > >>>>>>>
> > > > > > >>>>>>>> So I'm running a backfill for what feels like the first
> > time
> > > > in
> > > > > > >>> years
> > > > > > >>>>>>> using
> > > > > > >>>>>>>> a simple `airflow backfill --local` commands.
> > > > > > >>>>>>>>
> > > > > > >>>>>>>> First I start getting a ton of `logging.info` of each
> > tasks
> > > > > that
> > > > > > >>>> cannot
> > > > > > >>>>>>> be
> > > > > > >>>>>>>> started just yet at every tick flooding my terminal with
> > the
> > > > > > >> keyword
> > > > > > >>>>>>>> `FAILED` in it, looking like a million of lines like
> this
> > > one:
> > > > > > >>>>>>>>
> > > > > > >>>>>>>> [2018-05-24 14:33:07,852] {models.py:1123} INFO -
> > > Dependencies
> > > > > not
> > > > > > >>> met
> > > > > > >>>>>>> for
> > > > > > >>>>>>>> <TaskInstance: some_dag.some_task_id 2018-01-28 00:00:00
> > > > > > >>> [scheduled]>,
> > > > > > >>>>>>>> dependency 'Trigger Rule' FAILED: Task's trigger rule
> > > > > > >> 'all_success'
> > > > > > >>> re
> > > > > > >>>>>>>> quires all upstream tasks to have succeeded, but found 1
> > > > > > >>>> non-success(es).
> > > > > > >>>>>>>> upstream_tasks_state={'successes': 0L, 'failed': 0L,
> > > > > > >>>> 'upstream_failed':
> > > > > > >>>>>>>> 0L,
> > > > > > >>>>>>>> 'skipped': 0L, 'done': 0L},
> upstream_task_ids=['some_other
> > > > > > >> _task_id']
> > > > > > >>>>>>>>
> > > > > > >>>>>>>> Good thing I triggered 1 month and not 2 years like I
> > > actually
> > > > > > >> need,
> > > > > > >>>> just
> > > > > > >>>>>>>> the logs here would be "big data". Now I'm unclear
> whether
> > > > > there's
> > > > > > >>>>>>> anything
> > > > > > >>>>>>>> actually running or if I did something wrong, so I
> decide
> > to
> > > > > kill
> > > > > > >>> the
> > > > > > >>>>>>>> process so I can set a smaller date range and get a
> better
> > > > > picture
> > > > > > >>> of
> > > > > > >>>>>>>> what's up.
> > > > > > >>>>>>>>
> > > > > > >>>>>>>> I check my logging level, am I in DEBUG? Nope. Just
> INFO.
> > > So I
> > > > > > >> take
> > > > > > >>> a
> > > > > > >>>>>>> note
> > > > > > >>>>>>>> that I'll need to find that log-flooding line and demote
> > it
> > > to
> > > > > > >> DEBUG
> > > > > > >>>> in a
> > > > > > >>>>>>>> quick PR, no biggy.
> > > > > > >>>>>>>>
> > > > > > >>>>>>>> Now I restart with just a single schedule, and get an
> > error
> > > > `Dag
> > > > > > >>>>>>> {some_dag}
> > > > > > >>>>>>>> has reached maximum amount of 3 dag runs`. Hmmm, I wish
> > > > backfill
> > > > > > >>> could
> > > > > > >>>>>>> just
> > > > > > >>>>>>>> pickup where it left off. Maybe I need to run an
> `airflow
> > > > clear`
> > > > > > >>>> command
> > > > > > >>>>>>>> and restart? Ok, ran my clear command, same error is
> > showing
> > > > up.
> > > > > > >>> Dead
> > > > > > >>>>>>> end.
> > > > > > >>>>>>>>
> > > > > > >>>>>>>> Maybe there is some new `airflow clear --reset-dagruns`
> > > > option?
> > > > > > >>>> Doesn't
> > > > > > >>>>>>>> look like it... Maybe `airflow backfill` has some new
> > > switches
> > > > > to
> > > > > > >>>> pick up
> > > > > > >>>>>>>> where it left off? Can't find it. Am I supposed to clear
> > the
> > > > DAG
> > > > > > >>> Runs
> > > > > > >>>>>>>> manually in the UI? This is a pre-production,
> > in-development
> > > > > DAG,
> > > > > > >> so
> > > > > > >>>>>>> it's
> > > > > > >>>>>>>> not on the production web server. Am I supposed to fire
> up
> > > my
> > > > > own
> > > > > > >>> web
> > > > > > >>>>>>>> server to go and manually handle the backfill-related
> DAG
> > > > Runs?
> > > > > > >>>> Cannot to
> > > > > > >>>>>>>> my staging MySQL and do manually clear some DAG runs?
> > > > > > >>>>>>>>
> > > > > > >>>>>>>> So. Fire up a web server, navigate to my dag_id, delete
> > the
> > > > DAG
> > > > > > >>> runs,
> > > > > > >>>> it
> > > > > > >>>>>>>> appears I can finally start over.
> > > > > > >>>>>>>>
> > > > > > >>>>>>>> Next thought was: "Alright looks like I need to go Linus
> > on
> > > > the
> > > > > > >>>> mailing
> > > > > > >>>>>>>> list".
> > > > > > >>>>>>>>
> > > > > > >>>>>>>> What am I missing? I'm really hoping these issues
> specific
> > > to
> > > > > > >> 1.8.2!
> > > > > > >>>>>>>>
> > > > > > >>>>>>>> Backfilling is core to Airflow and should work very
> well.
> > I
> > > > want
> > > > > > >> to
> > > > > > >>>>>>> restate
> > > > > > >>>>>>>> some reqs for Airflow backfill:
> > > > > > >>>>>>>> * when failing / interrupted, it should seamlessly be
> able
> > > to
> > > > > > >> pickup
> > > > > > >>>>>>> where
> > > > > > >>>>>>>> it left off
> > > > > > >>>>>>>> * terminal logging at the INFO level should be a clear,
> > > human
> > > > > > >>>> consumable,
> > > > > > >>>>>>>> indicator of progress
> > > > > > >>>>>>>> * backfill-related operations (including restarts)
> should
> > be
> > > > > > >> doable
> > > > > > >>>>>>> through
> > > > > > >>>>>>>> CLI interactions, and not require web server
> interactions
> > as
> > > > the
> > > > > > >>>> typical
> > > > > > >>>>>>>> sandbox (dev environment) shouldn't assume the existence
> > of
> > > a
> > > > > web
> > > > > > >>>> server
> > > > > > >>>>>>>>
> > > > > > >>>>>>>> Let's fix this.
> > > > > > >>>>>>>>
> > > > > > >>>>>>>> Max
> > > > > > >>>>>>>>
> > > > > > >>>>>>>
> > > > > > >>>>
> > > > > > >>>
> > > > > > >>
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> > >
> > > --
> > >
> > > Chao-Han Tsai
> > >
> >
>

Reply via email to