Jarek,

I tend to agree with you on this, but let me play devil’s advocate. If I
have a DAG that runs a report every Tuesday, I might want it to run every
Tuesday starting whenever I am able to release the DAG. But if I release on
a Friday, I don’t want it to try to run “for” last Tuesday. In this case,
the correct start_date for the dag is the day I release the DAG, but I
don’t know this date ahead of time and it differs per environment. Doing
this properly seems doable with a CD process that edits the DAG to insert
the start_date, but that’s fairly sophisticated tooling for a scenario that
I imagine is quite common.

Collin McNulty

On Sun, Mar 20, 2022 at 1:55 PM Jarek Potiuk <[email protected]> wrote:

> Once again - why is it bad to set a start_date in the future, when -
> well - you **actually** want to run the first interval in the future ?
> What prevents you from setting the start-date to be a fixed time in
> the future, where the start date is within the interval you want to
> start first? Is it just "I do not want to specify conveniently
> whatever past date will be easy to type?"
> If this is the only reason,  then it has a big drawback - because
> "start_date" is **actually** supposed to be the piece of metadata for
> the DAG that will tell you what was the intention of the DAG writer on
> when to start it. And precisely one that allows you to start things in
> the future.
>
> Am I missing something?
>
> On Sun, Mar 20, 2022 at 7:42 PM Larry Komenda
> <[email protected]> wrote:
> >
> > Alex, that's a good point regarding the need to run a DAG for the most
> recent schedule interval right away. I hadn't thought of that scenario as I
> haven't needed to build a DAG with that large of a scheduling gap. In that
> case I agree with you - it seems like it would make more sense to make this
> configurable.
> >
> > Perhaps there could be an additional DAG-level parameter that could be
> set alongside "catchup" to control this behavior. Or there could be a new
> parameter that could eventually replace "catchup" that supported 3 options
> - "catchup", "run most recent interval only", and "run next interval only".
> >
> > On Sat, Mar 19, 2022 at 1:02 PM Alex Begg <[email protected]> wrote:
> >>
> >> I would not consider it a bug to have the latest data interval run when
> you enable a DAG that is set to catchup=False.
> >>
> >> I have legitimate use for that feature by having my production
> environment have catchup_by_default=True but my lower environments are
> using catchup_by_default=False, meaning if I want to test the DAG behavior
> as scheduled in a lower environment I can just enable the DAG.
> >>
> >> For example, in a staging environment if I need to test out the
> functionality of a DAG that was scheduled for @monthly and there was no way
> to test the most recent data interval, than to test a true data interval of
> the DAG it could be many days, even weeks until they will occur.
> >>
> >> Triggering a DAG won’t run the latest data interval, it will use the
> current time as the logical_date, right? So that will won’t let me test a
> single as scheduled data interval. So in that @monthly senecio it will be
> impossible for me to test the functionality of a single data interval
> unless I wait multiple weeks.
> >>
> >> I see there could be a desire to not run the latest data interval and
> just start with whatever full interval follows the DAG being turned on.
> However I think that should be configurable, not fixed permanently.
> >>
> >> Alternatively it could be ideal to have a way to trigger a specific run
> for a catchup=False DAG that just got enabled by adding a 3d option to the
> trigger button drop down to trigger a past scheduled run. Then in that
> dialog the form can default to the most recent full data interval but then
> let you also specify a specific past interval based on the DAG's schedule.
> I often had to debug a DAG in production and I wanted to trigger a specific
> past data interval, not just the most recent.
> >>
> >> Alex Begg
> >>
> >> On Thu, Mar 17, 2022 at 4:58 PM Larry Komenda <
> [email protected]> wrote:
> >>>
> >>> I agree with this. I'd much rather have to trigger a single manual run
> the first time I enable a DAG than to either wait to enable until after I
> want it to run or by editing the start_date of the DAG itself.
> >>>
> >>> I'd be in favor of adjusting this behavior either permanently or by a
> configuration.
> >>>
> >>> On Fri, Mar 4, 2022 at 3:00 PM Philippe Lanoe
> <[email protected]> wrote:
> >>>>
> >>>> Hello Daniel,
> >>>>
> >>>> Thank you for your answer. In your example, as I experienced, the
> first run would not be 2010-01-01 but 2022-03-03, 00:00:00 (it is currently
> March 4 - 21:00 here), which is the execution date corresponding to the
> start of the previous data interval, but the result is the same: an
> undesired dag run. (For instance, in case of cron schedule '00 22 * * *',
> one dagrun would be started immediately with execution date of 2022-03-02,
> 22:00:00)
> >>>>
> >>>> I also agree with you that it could be categorized as a bug and I
> would also vote for a fix.
> >>>>
> >>>> Would be great to have the feedback of others on this.
> >>>>
> >>>> On Fri, Mar 4, 2022 at 6:17 PM Daniel Standish
> <[email protected]> wrote:
> >>>>>
> >>>>> You are saying, when you turn on for the first time a dag with e.g.
> @daily schedule, and catchup = False, if start date is 2010-01-01, then it
> would run first the 2010-01-01 run, then the current run (whatever
> yesterday is)?  That sounds familiar.
> >>>>>
> >>>>> Yeah I don't like that behavior.  I agree that, as you say, it's not
> the intuitive behavior.  Seems it could reasonably be categorized as a
> bug.  I'd prefer we just "fix" it rather than making it configurable.  But
> some might have concerns re backcompat.
> >>>>>
> >>>>> What do others think?
> >>>>>
> >>>>>
>
-- 

Collin McNulty
Lead Airflow Engineer

Email: [email protected] <[email protected]>
Time zone: US Central (CST UTC-6 / CDT UTC-5)


<https://www.astronomer.io/>

Reply via email to