Re: [PROPOSAL][AIP-36 DAG Versioning]

Maxime Beauchemin Tue, 28 Jul 2020 08:45:14 -0700

> "mixed version"

Personally I vote for a DAG version to be pinned and consistent for the
duration of the DAG run. Some of the reasons why:
- it's easier to reason about, and therefore visualize and troubleshoot
- it prevents some cases where dependencies are never met
- it prevents the explosion of artifact/metadata (one serialization per
dagrun as opposed to one per scheduler cycle) in the case of a dynamic DAG
whose fingerprint is never the same. I'm certain that there are lots of
those DAGs out there, and that it will overwhelm the metadata database, and
confuse the users. For an hourly DAG is would mean 24 artifact per day
instead of 1000+


If backwards compatibility in behavior is a concern, I'd recommend adding a
flag to the DAG class and/or config and make sure we're doing the right
thing by default. People who want backward compatibility would have to
change that default. But again, that's a lot of extra and confusing
complexity that will likely be the source of bugs and user confusion.
Having a clear, easy to reason about execution model is super important.

Think about visualizing a DAG that shapeshifted 5 times during its
execution, how does anyone make sense of that?

Max

On Tue, Jul 28, 2020 at 3:14 AM Kaxil Naik <kaxiln...@gmail.com> wrote:

> Thanks Max for your comments.
>
>
> *DAG Fingerprinting: *this can be tricky, especially in regards to dynamic
> > DAGs, where in some cases each parsing of the DAG can result in a
> different
> > fingerprint. I think DAG and tasks attributes are left out from the
> > proposal that should be considered as part of the fingerprint, like
> trigger
> > rules or task start/end datetime. We should do a full pass of all DAG
> > arguments and make sure we're not forgetting anything that can change
> > scheduling logic. Also, let's be careful that something as simple as a
> > dynamic start or end date on a task could lead to a different version
> each
> > time you parse.
>
>
>
> The short version of Dag Fingerprinting would be
> just a hash of the Serialized DAG.
>
> *Example DAG*: https://imgur.com/TVuoN3p
> *Example Serialized DAG*: https://imgur.com/LmA2Bpr
>
> It contains all the task & DAG parameters. When they change, Scheduler
> writes
> a new version of Serialized DAGs to the DB. The Webserver then reads the
> DAGs from the DB.
>
> I'd recommend limiting serialization/storage of one version
> > per DAG Run, as opposed to potentially everytime the DAG is parsed - once
> > the version for a DAG run is pinned, fingerprinting is not re-evaluated
> > until the next DAG run is ready to get created.
>
>
> This is to handle Scenario 3 where a DAG structure is changed mid-way.
> Since we don't intend to
> change the execution behaviour, if we limit Storage of 1 version per DAG,
> it won't actually show what
> was run.
>
> Example Dag v1: Task A -> Task B -> Task C
> The worker has completed the execution of Task B and is just about to
> complete the execution of Task B.
>
> The 2nd version of DAG is deployed: Task A -> Task D
> Now Scheduler queued Task D and it will run to completion. (Task C won't
> run)
>
> In this case, "the actual representation of the DAG" that run is neither v1
> nor v2 but a "mixed version"
>  (Task A -> Task B -> Task D). The plan is that the Scheduler will create
> this "mixed version" based on what ran
> and the Graph View would show this "mixed version".
>
> There would also be a toggle button on the Graph View to select v1 or v2
> where the tasks will be highlighted to show
> that a particular task was in v1 or v2 as shown in
>
> https://cwiki.apache.org/confluence/download/attachments/158868919/Picture%201.png?version=2&modificationDate=1595612863000&api=v2
>
>
>
> *Visualizing change in the tree view:* I think this is very complex and
> > many things can make this view impossible to render (task dependency
> > reversal, cycles across versions, ...). Maybe a better visual approach
> > would be to render independent, individual tree views for each DAG
> version
> > (side by side), and doing best effort aligning the tasks across blocks
> and
> > "linking" tasks with lines across blocks when necessary.
>
>
> Agreed, the plan is to do the best effort aligning.
> At this point in time, task additions to the end of the DAG are expected to
> be compatible,
> but changes to task structure within the DAG may cause the tree view not to
> incorporate “old” and “new” in the same view, hence that won't be shown.
>
> Regards,
> Kaxil
>
> On Mon, Jul 27, 2020 at 6:02 PM Maxime Beauchemin <
> maximebeauche...@gmail.com> wrote:
>
> > Some notes and ideas:
> >
> > *DAG Fingerprinting: *this can be tricky, especially in regards to
> dynamic
> > DAGs, where in some cases each parsing of the DAG can result in a
> different
> > fingerprint. I think DAG and tasks attributes are left out from the
> > proposal that should be considered as part of the fingerprint, like
> trigger
> > rules or task start/end datetime. We should do a full pass of all DAG
> > arguments and make sure we're not forgetting anything that can change
> > scheduling logic. Also, let's be careful that something as simple as a
> > dynamic start or end date on a task could lead to a different version
> each
> > time you parse. I'd recommend limiting serialization/storage of one
> version
> > per DAG Run, as opposed to potentially everytime the DAG is parsed - once
> > the version for a DAG run is pinned, fingerprinting is not re-evaluated
> > until the next DAG run is ready to get created.
> >
> > *Visualizing change in the tree view:* I think this is very complex and
> > many things can make this view impossible to render (task dependency
> > reversal, cycles across versions, ...). Maybe a better visual approach
> > would be to render independent, individual tree views for each DAG
> version
> > (side by side), and doing best effort aligning the tasks across blocks
> and
> > "linking" tasks with lines across blocks when necessary.
> >
> > On Fri, Jul 24, 2020 at 12:46 PM Vikram Koka <vik...@astronomer.io>
> wrote:
> >
> > > Team,
> > >
> > >
> > >
> > > We just created 'AIP-36 DAG Versioning' on Confluence and would very
> much
> > > appreciate feedback and suggestions from the community.
> > >
> > >
> > >
> > >
> >
> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-36+DAG+Versioning
> > >
> > >
> > >
> > > The DAG Versioning concept has been discussed on multiple occasions in
> > the
> > > past and has been a topic highlighted as part of Airflow 2.0 as well.
> We
> > at
> > > Astronomer have heard data engineers at several enterprises ask about
> > this
> > > feature as well, for easier debugging when changes are made to DAGs as
> a
> > > result of evolving business needs.
> > >
> > >
> > > As described in the AIP, we have a proposal focused on ensuring that
> the
> > > visibility behaviour of Airflow is correct, without changing the
> > execution
> > > behaviour. We considered changing the execution behaviour as well, but
> > > decided that the risks in changing execution behavior were too high as
> > > compared to the benefits and therefore decided to limit the scope to
> only
> > > making sure that the visibility was correct.
> > >
> > >
> > > We would like to attempt this based on our experience running Airflow
> as
> > a
> > > service. We believe that this benefits Airflow as a project and the
> > > development experience of data engineers using Airflow across the
> world.
> > >
> > >
> > >  Any feedback, suggestions, and comments would be greatly appreciated.
> > >
> > >
> > >
> > > Best Regards,
> > >
> > >
> > > Kaxil Naik, Ryan Hamilton, Ash Berlin-Taylor, and Vikram Koka
> > >
> >
>

Re: [PROPOSAL][AIP-36 DAG Versioning]

Reply via email to