+1 on improving backfill.

- The terminal interface was uselessly verbose. It was scrolling fast
> enough to be unreadable.


I agree that backfill is currently too verbose. It simply logs to many
things and it is hard to read. Often times, I only care about the number of
tasks/dagruns that are in-progress/finished/not started. I had a PR
<https://github.com/apache/airflow/pull/3478> that implements a progress
bar for backfill but was not able to finish. Probably something that can
help improve the backfill experience.

- The backfill exceeded safe concurrency limits for the cluster and
> could've easily brought it down if I'd left it running.


Btw. backfill now respects pool limitation but we should probably looking
into making it respect concurrency limit.

Chao-Han

>
>


On Mon, Mar 4, 2019 at 12:35 PM James Meickle
<jmeic...@quantopian.com.invalid> wrote:

> This is an old thread, but I wanted to bump it as I just had a really bad
> experience using backfill. I'd been hesitant to even try backfills out
> given what I've read about it, so I've just relied on the UI to "Clear"
> entire tasks. However, I wanted to give it a shot the "right" way. Issues I
> ran into:
>
> - The dry run flag didn't give good feedback about which dagruns and task
> instances will be affected (and is very easy to typo as "--dry-run")
>
> - The terminal interface was uselessly verbose. It was scrolling fast
> enough to be unreadable.
>
> - The backfill exceeded safe concurrency limits for the cluster and
> could've easily brought it down if I'd left it running.
>
> - Tasks in the backfill were executed out of order despite the tasks having
> `depends_on_past`
>
> - The backfill converted all existing DAGRuns to be backfill runs that the
> scheduler later ignored, which is not how I would've expected this to work
> (nor was it indicated in the dry run)
>
> I ended up having to do manual recovery work in the database to turn the
> "backfill" runs back into scheduler runs, and then switch to using `airflow
> clear`. I'm a heavy Airflow user and this took me an hour; it would've been
> much worse for anyone else on my team.
>
> I don't have any specific suggestions here other than to confirm that this
> feature needs an overhaul if it's to be recommended to anyone.
>
> On Fri, Jun 8, 2018 at 5:38 PM Maxime Beauchemin <
> maximebeauche...@gmail.com>
> wrote:
>
> > Ash I don't see how this could happen unless maybe the node doing the
> > backfill is using another metadata database.
> >
> > In general we recommend for people to run --local backfills and have the
> > default/sandbox template for `airflow.cfg` use a LocalExecutor with
> > reasonable parallelism to make that behavior the default.
> >
> > Given the [not-so-great] state of backfill, I'm guessing many have been
> > using the scheduler to do backfills. From that regard it would be nice to
> > have CLI commands to generate dagruns or alter the state of existing ones
> >
> > Max
> >
> > On Fri, Jun 8, 2018 at 8:56 AM Ash Berlin-Taylor <
> > ash_airflowl...@firemirror.com> wrote:
> >
> > > Somewhat related to this, but likely a different issue:
> > >
> > > I've just had a case where a long (7hours) running backfill task ended
> up
> > > running twice somehow. We're using Celery so this might be related to
> > some
> > > sort of Celery visibility timeout, but I haven't had a chance to be
> able
> > to
> > > dig in to it in detail - it's 5pm on a Friday :D
> > >
> > > Has anyone else noticed anything similar?
> > >
> > > -ash
> > >
> > >
> > > > On 8 Jun 2018, at 01:22, Tao Feng <fengta...@gmail.com> wrote:
> > > >
> > > > Thanks everyone for the feedback especially on the background for
> > > backfill.
> > > > After reading the discussion, I think it would be safest to add a
> flag
> > > for
> > > > auto rerun failed tasks for backfill with default to be false. I have
> > > > updated the pr accordingly.
> > > >
> > > > Thanks a lot,
> > > > -Tao
> > > >
> > > > On Wed, Jun 6, 2018 at 1:47 PM, Mark Whitfield <
> > > mark.whitfi...@nytimes.com>
> > > > wrote:
> > > >
> > > >> I've been doing some work setting up a large, collaborative Airflow
> > > >> pipeline with a group that makes heavy use of backfills, and have
> been
> > > >> encountering a lot of these issues myself.
> > > >>
> > > >> Other gripes:
> > > >>
> > > >> Backfills do not obey concurrency pool restrictions. We had been
> > making
> > > >> heavy use of SubDAGs and using concurrency pools to prevent
> deadlocks
> > > (why
> > > >> does the SubDAG itself even need to occupy a concurrency slot if
> none
> > of
> > > >> its constituent tasks are running?), but this quickly became
> untenable
> > > when
> > > >> using backfills and we were forced to mostly abandon SubDAGs.
> > > >>
> > > >> Backfills do use DagRuns now, which is a big improvement. However,
> > it's
> > > a
> > > >> common use case for us to add new tasks to a DAG and backfill to a
> > date
> > > >> specific to that task. When we do this, the BackfillJob will pick up
> > > >> previous backfill DagRuns and re-use them, which is mostly nice
> > because
> > > it
> > > >> keeps the Tree view neatly organized in the UI. However, it does not
> > > reset
> > > >> the start time of the DagRun when it does this. Combined with a
> > > DAG-level
> > > >> timeout, this means that the backfill job will activate a DagRun,
> but
> > > then
> > > >> the run will immediately time out (since it still thinks it's been
> > > running
> > > >> since the previous backfill). This will cause tasks to deadlock
> > > spuriously,
> > > >> making backfills extremely cumbersome to carry out.
> > > >>
> > > >> *Mark Whitfield*
> > > >> Data Scientist
> > > >> New York Times
> > > >>
> > > >>
> > > >> On Wed, Jun 6, 2018 at 3:33 PM Maxime Beauchemin <
> > > >> maximebeauche...@gmail.com>
> > > >> wrote:
> > > >>
> > > >>> Thanks for the input, this is helpful.
> > > >>>
> > > >>> To add to the list, there's some complexity around concurrency
> > > management
> > > >>> and multiple executors:
> > > >>> I just hit this thing where backfill doesn't check DAG-level
> > > concurrency,
> > > >>> fires up 32 tasks, and `airlfow run` double-checks DAG-level
> > > concurrency
> > > >>> limit and exits. Right after backfill reschedules right away and so
> > on,
> > > >>> burning a bunch of CPU doing nothing. In this specific case it
> seems
> > > like
> > > >>> `airflow run` should skip that specific check when in the context
> of
> > a
> > > >>> backfill.
> > > >>>
> > > >>> Max
> > > >>>
> > > >>> On Tue, Jun 5, 2018 at 9:23 PM Bolke de Bruin <bdbr...@gmail.com>
> > > wrote:
> > > >>>
> > > >>>> Thinking out loud here, because it is a while back that I did work
> > on
> > > >>>> backfills. There were some real issues with backfills:
> > > >>>>
> > > >>>> 1. Tasks were running in non deterministic order ending up in
> > regular
> > > >>>> deadlocks
> > > >>>> 2. Didn’t create dag runs, making behavior inconsistent. Max dag
> > runs
> > > >>>> could not be enforced. Ui could really display it, lots of minor
> > other
> > > >>>> issues because of it.
> > > >>>> 3. Behavior was different from the scheduler, while
> subdagoperators
> > > >>>> particularly make use of backfills at the moment.
> > > >>>>
> > > >>>> I think with 3 the behavior you are observing crept in. And given
> 3
> > I
> > > >>>> would argue a consistent behavior between the scheduler and the
> > > >> backfill
> > > >>>> mechanism is still paramount. Thus we should explicitly clear
> tasks
> > > >> from
> > > >>>> failed if we want to rerun them. This at least until we move the
> > > >>>> subdagoperator out of backfill and into the scheduler (which is
> > > >> actually
> > > >>>> not too hard). Also we need those command line options anyway.
> > > >>>>
> > > >>>> Bolke
> > > >>>>
> > > >>>> Verstuurd vanaf mijn iPad
> > > >>>>
> > > >>>>> Op 6 jun. 2018 om 01:27 heeft Scott Halgrim <
> > > >> scott.halg...@zapier.com
> > > >>> .INVALID>
> > > >>>> het volgende geschreven:
> > > >>>>>
> > > >>>>> The request was for opposition, but I’d like to weigh in on the
> > side
> > > >> of
> > > >>>> “it’s a better behavior [to have failed tasks re-run when cleared
> > in a
> > > >>>> backfill"
> > > >>>>>> On Jun 5, 2018, 4:16 PM -0700, Maxime Beauchemin <
> > > >>>> maximebeauche...@gmail.com>, wrote:
> > > >>>>>> @Jeremiah Lowin <jlo...@gmail.com> & @Bolke de Bruin <
> > > >>> bdbr...@gmail.com>
> > > >>>> I
> > > >>>>>> think you may have some context on why this may have changed at
> > some
> > > >>>> point.
> > > >>>>>> I'm assuming that when DagRun handling was added to the backfill
> > > >>> logic,
> > > >>>> the
> > > >>>>>> behavior just happened to change to what it is now.
> > > >>>>>>
> > > >>>>>> Any opposition in moving back towards re-running failed tasks
> when
> > > >>>> starting
> > > >>>>>> a backfill? I think it's a better behavior, though it's a change
> > in
> > > >>>>>> behavior that we should mention in UPDATE.md.
> > > >>>>>>
> > > >>>>>> One of our goals is to make sure that a failed or killed
> backfill
> > > >> can
> > > >>> be
> > > >>>>>> restarted and just seamlessly pick up where it left off.
> > > >>>>>>
> > > >>>>>> Max
> > > >>>>>>
> > > >>>>>>> On Tue, Jun 5, 2018 at 3:25 PM Tao Feng <fengta...@gmail.com>
> > > >> wrote:
> > > >>>>>>>
> > > >>>>>>> After discussing with Max, we think it would be great if
> `airflow
> > > >>>> backfill`
> > > >>>>>>> could be able to auto pick up and rerun those failed tasks.
> > > >>> Currently,
> > > >>>> it
> > > >>>>>>> will throw exceptions(
> > > >>>>>>>
> > > >>>>>>>
> > > >>>>
> > > >>> https://github.com/apache/incubator-airflow/blob/master/airf
> > > >> low/jobs.py#L2489
> > > >>>>>>> )
> > > >>>>>>> without rerunning the failed tasks.
> > > >>>>>>>
> > > >>>>>>> But since it broke some of the previous assumptions for
> backfill,
> > > >> we
> > > >>>> would
> > > >>>>>>> like to get some feedback and see if anyone has any concerns(pr
> > > >> could
> > > >>>> be
> > > >>>>>>> found at https://github.com/apache/incu
> > > >> bator-airflow/pull/3464/files
> > > >>> ).
> > > >>>>>>>
> > > >>>>>>> Thanks,
> > > >>>>>>> -Tao
> > > >>>>>>>
> > > >>>>>>> On Thu, May 24, 2018 at 10:26 AM, Maxime Beauchemin <
> > > >>>>>>> maximebeauche...@gmail.com> wrote:
> > > >>>>>>>
> > > >>>>>>>> So I'm running a backfill for what feels like the first time
> in
> > > >>> years
> > > >>>>>>> using
> > > >>>>>>>> a simple `airflow backfill --local` commands.
> > > >>>>>>>>
> > > >>>>>>>> First I start getting a ton of `logging.info` of each tasks
> > that
> > > >>>> cannot
> > > >>>>>>> be
> > > >>>>>>>> started just yet at every tick flooding my terminal with the
> > > >> keyword
> > > >>>>>>>> `FAILED` in it, looking like a million of lines like this one:
> > > >>>>>>>>
> > > >>>>>>>> [2018-05-24 14:33:07,852] {models.py:1123} INFO - Dependencies
> > not
> > > >>> met
> > > >>>>>>> for
> > > >>>>>>>> <TaskInstance: some_dag.some_task_id 2018-01-28 00:00:00
> > > >>> [scheduled]>,
> > > >>>>>>>> dependency 'Trigger Rule' FAILED: Task's trigger rule
> > > >> 'all_success'
> > > >>> re
> > > >>>>>>>> quires all upstream tasks to have succeeded, but found 1
> > > >>>> non-success(es).
> > > >>>>>>>> upstream_tasks_state={'successes': 0L, 'failed': 0L,
> > > >>>> 'upstream_failed':
> > > >>>>>>>> 0L,
> > > >>>>>>>> 'skipped': 0L, 'done': 0L}, upstream_task_ids=['some_other
> > > >> _task_id']
> > > >>>>>>>>
> > > >>>>>>>> Good thing I triggered 1 month and not 2 years like I actually
> > > >> need,
> > > >>>> just
> > > >>>>>>>> the logs here would be "big data". Now I'm unclear whether
> > there's
> > > >>>>>>> anything
> > > >>>>>>>> actually running or if I did something wrong, so I decide to
> > kill
> > > >>> the
> > > >>>>>>>> process so I can set a smaller date range and get a better
> > picture
> > > >>> of
> > > >>>>>>>> what's up.
> > > >>>>>>>>
> > > >>>>>>>> I check my logging level, am I in DEBUG? Nope. Just INFO. So I
> > > >> take
> > > >>> a
> > > >>>>>>> note
> > > >>>>>>>> that I'll need to find that log-flooding line and demote it to
> > > >> DEBUG
> > > >>>> in a
> > > >>>>>>>> quick PR, no biggy.
> > > >>>>>>>>
> > > >>>>>>>> Now I restart with just a single schedule, and get an error
> `Dag
> > > >>>>>>> {some_dag}
> > > >>>>>>>> has reached maximum amount of 3 dag runs`. Hmmm, I wish
> backfill
> > > >>> could
> > > >>>>>>> just
> > > >>>>>>>> pickup where it left off. Maybe I need to run an `airflow
> clear`
> > > >>>> command
> > > >>>>>>>> and restart? Ok, ran my clear command, same error is showing
> up.
> > > >>> Dead
> > > >>>>>>> end.
> > > >>>>>>>>
> > > >>>>>>>> Maybe there is some new `airflow clear --reset-dagruns`
> option?
> > > >>>> Doesn't
> > > >>>>>>>> look like it... Maybe `airflow backfill` has some new switches
> > to
> > > >>>> pick up
> > > >>>>>>>> where it left off? Can't find it. Am I supposed to clear the
> DAG
> > > >>> Runs
> > > >>>>>>>> manually in the UI? This is a pre-production, in-development
> > DAG,
> > > >> so
> > > >>>>>>> it's
> > > >>>>>>>> not on the production web server. Am I supposed to fire up my
> > own
> > > >>> web
> > > >>>>>>>> server to go and manually handle the backfill-related DAG
> Runs?
> > > >>>> Cannot to
> > > >>>>>>>> my staging MySQL and do manually clear some DAG runs?
> > > >>>>>>>>
> > > >>>>>>>> So. Fire up a web server, navigate to my dag_id, delete the
> DAG
> > > >>> runs,
> > > >>>> it
> > > >>>>>>>> appears I can finally start over.
> > > >>>>>>>>
> > > >>>>>>>> Next thought was: "Alright looks like I need to go Linus on
> the
> > > >>>> mailing
> > > >>>>>>>> list".
> > > >>>>>>>>
> > > >>>>>>>> What am I missing? I'm really hoping these issues specific to
> > > >> 1.8.2!
> > > >>>>>>>>
> > > >>>>>>>> Backfilling is core to Airflow and should work very well. I
> want
> > > >> to
> > > >>>>>>> restate
> > > >>>>>>>> some reqs for Airflow backfill:
> > > >>>>>>>> * when failing / interrupted, it should seamlessly be able to
> > > >> pickup
> > > >>>>>>> where
> > > >>>>>>>> it left off
> > > >>>>>>>> * terminal logging at the INFO level should be a clear, human
> > > >>>> consumable,
> > > >>>>>>>> indicator of progress
> > > >>>>>>>> * backfill-related operations (including restarts) should be
> > > >> doable
> > > >>>>>>> through
> > > >>>>>>>> CLI interactions, and not require web server interactions as
> the
> > > >>>> typical
> > > >>>>>>>> sandbox (dev environment) shouldn't assume the existence of a
> > web
> > > >>>> server
> > > >>>>>>>>
> > > >>>>>>>> Let's fix this.
> > > >>>>>>>>
> > > >>>>>>>> Max
> > > >>>>>>>>
> > > >>>>>>>
> > > >>>>
> > > >>>
> > > >>
> > >
> > >
> >
>


-- 

Chao-Han Tsai

Reply via email to