Re: airflow start_date confusion:

harish singh Sun, 12 Jun 2016 21:59:59 -0700

:) I did read this before posting.
The question I have is:

Say I have 3 DAGS.  Lets say I set
'start_date' : datetime(2015, 6, 1)


Now, in my pipeline.py, if I add a dynamically query some database table
and create DAGS.
Lets say tomorrow if I add a new DAG.
That new DAG will get the same start_date = datetime(2015, 6, 1).
Which means, the pipeline for this new dag will start from  datetime(2015, 6
, 1) and not from datetime.now().

I am trying to understand what is a correct approach for setitng this param
so that it becomes flexible and extensible for future dags?


On Sun, Jun 12, 2016 at 4:42 PM, Maxime Beauchemin <
[email protected]> wrote:

> From: http://pythonhosted.org/airflow/faq.html
>
> *What’s the deal with ``start_date``?*
>
> start_date is partly legacy from the pre-DagRun era, but it is still
> relevant in many ways. When creating a new DAG, you probably want to set a
> global start_date for your tasks usingdefault_args. The first DagRun to be
> created will be based on the min(start_date) for all your task. From that
> point on, the scheduler creates new DagRuns based on your
> schedule_interval and
> the corresponding task instances run as your dependencies are met. When
> introducing new tasks to your DAG, you need to pay special attention to
> start_date, and may want to reactivate inactive DagRuns to get the new task
> to get onboarded properly.
>
> We recommend against using dynamic values as start_date, especially
> datetime.now() as it can be quite confusing. The task is triggered once the
> period closes, and in theory an @hourly DAG would never get to an hour
> after now as now() moves along.
>
> We also recommend using rounded start_date in relation to your
> schedule_interval. This means an @hourly would be at 00:00 minutes:seconds,
> a @daily job at midnight, a @monthly job on the first of the month. You can
> use any sensor or a TimeDeltaSensor to delay the execution of tasks within
> that period. While schedule_interval does allow specifying a
> datetime.timedelta object, we recommend using the macros or cron
> expressions instead, as it enforces this idea of rounded schedules.
>
> When using depends_on_past=True it’s important to pay special attention to
> start_date as the past dependency is not enforced only on the specific
> schedule of the start_date specified for the task. It’ also important to
> watch DagRun activity status in time when introducing new
> depends_on_past=True, unless you are planning on running a backfill for the
> new task(s).
>
> Also important to note is that the tasks start_date, in the context of a
> backfill CLI command, get overridden by the backfill’s command start_date.
> This allows for a backfill on tasks that havedepends_on_past=True to
> actually start, if it wasn’t the case, the backfill just wouldn’t start.
>
> On Sun, Jun 12, 2016 at 3:17 PM, harish singh <[email protected]>
> wrote:
>
> > These are the default args to my DAG.
> > I am trying to run a standard hourly job (basically, at the end of
> > this hour, process last hours data)
> > I noticed that my pipeline is 1 hour late.
> >
> > For some reason, I am messing up with my start_date I guess.
> > What is the best practice for setting up start_date?
> >
> >
> > scheduling_start_date = (datetime.utcnow()).replace(minute=0,
> > second=0, microsecond=0) +
> > datetime.timedelta(minutes=15)default_schedule_interval =
> > datetime.timedelta(minutes=60)default_args = {
> >
> >     'owner': 'airflow',
> >     'depends_on_past': False,
> >     'start_date': scheduling_start_date,
> >     'email': ['[email protected]'],
> >     'email_on_failure': False,
> >     'email_on_retry': False,
> >     'retries': 2,
> >     'retry_delay': default_retries_delay,    'schedule_interval'=
> > default_schedule_interval
> >
> >     # 'queue': 'bash_queue',
> >     # 'pool': 'backfill',
> >     # 'priority_weight': 10,
> >     # 'end_date': datetime(2016, 1, 1),
> > }
> >
>

Re: airflow start_date confusion:

Reply via email to