Re: [PROPOSAL][AIP-36 DAG Versioning]

Kaxil Naik Wed, 29 Jul 2020 05:20:28 -0700

Thanks, both Max and Dan for your comments, please check my reply below:


>  Personally I vote for a DAG version to be pinned and consistent for the
> duration of the DAG run. Some of the reasons why:
> - it's easier to reason about, and therefore visualize and troubleshoot
> - it prevents some cases where dependencies are never met
> - it prevents the explosion of artifact/metadata (one serialization per
> dagrun as opposed to one per scheduler cycle) in the case of a dynamic DAG
> whose fingerprint is never the same.


In this AIP, we were only looking to fix the current "Viewing behaviour" and
we were intentionally not changing the execution behaviour.
The change you are suggesting means we need to introduce DAG Versioning for
the
workers too. This will need more work as can't use the Serialised
Representation
to run the task since users could use custom modules in a different part of
code,
example the PythonOperator has python_callable that allows running any
arbitrary code.
A similar case is with the *on_*_callbacks* defined on DAG.

Based on the current scope of the AIP, we still plan to use the actual DAG
files for the
execution and not use Serialized DAGs for the workers.

To account for all the custom modules we will have to start looking at
pickle (cloudpickle).

I'm certain that there are lots of
> those DAGs out there, and that it will overwhelm the metadata database, and
> confuse the users. For an hourly DAG is would mean 24 artifact per day
> instead of 1000+


What kind of dynamic DAGs are we talking about here, I would think the DAG
signature won't change
but I might be wrong, can you give an example, please.

If backwards compatibility in behavior is a concern, I'd recommend adding a
> flag to the DAG class and/or config and make sure we're doing the right
> thing by default. People who want backward compatibility would have to
> change that default. But again, that's a lot of extra and confusing
> complexity that will likely be the source of bugs and user confusion.
> Having a clear, easy to reason about execution model is super important.

Think about visualizing a DAG that shapeshifted 5 times during its
> execution, how does anyone make sense of that?


Wouldn't that be an edge case? How often would someone change the DAG
structure in the middle of
a DAG execution. And since if they do change, the Graph View should show
all the tasks that were
run, if it just shows based on the latest version, the behaviour would be
the same as now.

--------

Strongly agree with Max's points, also I feel the right way to go about
> this is instead of Airflow schedulers/webservers/workers reading DAG Python
> files, they would instead read from serialized representations of the DAGs
> (e.g. json representation in the Airflow DB). Instead of DAG owners pushing
> their DAG files to the Airflow components via varying mechanisms (e.g.
> git), they would instead call an Airflow CLI to push the serialized DAG
> representations to the DB, and for things like dynamic DAGs you could
> populate them from a DAG or another service.


Airflow Webserver and the Scheduler will definitely read from the
Serialized representation as
they don't need all the code from the DAG files.

While the workers definitely need access to DAG files as the
tasks/operators would be using
code form custom modules and classes which are required to run the tasks.

If we do want to go down that route we will have to use something like
cloudpickle that serializes
entire DAG file and their dependencies. And also ensure that someone is not
able to change the pickled
source when sending from executor to the worker as that poses a big
security risk.

- Kaxil

On Wed, Jul 29, 2020 at 12:43 PM Jacob Ward <jw...@brandwatch.com> wrote:

> I came here to say what Max has said, only less eloquently.
>
> I do have one concern with locking the version for a single run. Currently
> it is possible for a user to create a dag which intentionally changes as a
> dag executes, i.e. dynamically creating a task for the dag during a run by
> modifying external data, but this change would prevent that. I'm of the
> opinion that this situation is bad practice anyway so it doesn't matter if
> we make it impossible to do, but others may disagree.
>
> On Tue, 28 Jul 2020 at 17:08, Dan Davydov <ddavy...@twitter.com.invalid>
> wrote:
>
> > Strongly agree with Max's points, also I feel the right way to go about
> > this is instead of Airflow schedulers/webservers/workers reading DAG
> Python
> > files, they would instead read from serialized representations of the
> DAGs
> > (e.g. json representation in the Airflow DB). Instead of DAG owners
> pushing
> > their DAG files to the Airflow components via varying mechanisms (e.g.
> > git), they would instead call an Airflow CLI to push the serialized DAG
> > representations to the DB, and for things like dynamic DAGs you could
> > populate them from a DAG or another service.
> >
> > This would also enable other features like stronger
> security/multi-tenancy.
> >
> > On Tue, Jul 28, 2020 at 6:44 PM Maxime Beauchemin <
> > maximebeauche...@gmail.com> wrote:
> >
> > > > "mixed version"
> > >
> > > Personally I vote for a DAG version to be pinned and consistent for the
> > > duration of the DAG run. Some of the reasons why:
> > > - it's easier to reason about, and therefore visualize and troubleshoot
> > > - it prevents some cases where dependencies are never met
> > > - it prevents the explosion of artifact/metadata (one serialization per
> > > dagrun as opposed to one per scheduler cycle) in the case of a dynamic
> > DAG
> > > whose fingerprint is never the same. I'm certain that there are lots of
> > > those DAGs out there, and that it will overwhelm the metadata database,
> > and
> > > confuse the users. For an hourly DAG is would mean 24 artifact per day
> > > instead of 1000+
> > >
> > > If backwards compatibility in behavior is a concern, I'd recommend
> > adding a
> > > flag to the DAG class and/or config and make sure we're doing the right
> > > thing by default. People who want backward compatibility would have to
> > > change that default. But again, that's a lot of extra and confusing
> > > complexity that will likely be the source of bugs and user confusion.
> > > Having a clear, easy to reason about execution model is super
> important.
> > >
> > > Think about visualizing a DAG that shapeshifted 5 times during its
> > > execution, how does anyone make sense of that?
> > >
> > > Max
> > >
> > > On Tue, Jul 28, 2020 at 3:14 AM Kaxil Naik <kaxiln...@gmail.com>
> wrote:
> > >
> > > > Thanks Max for your comments.
> > > >
> > > >
> > > > *DAG Fingerprinting: *this can be tricky, especially in regards to
> > > dynamic
> > > > > DAGs, where in some cases each parsing of the DAG can result in a
> > > > different
> > > > > fingerprint. I think DAG and tasks attributes are left out from the
> > > > > proposal that should be considered as part of the fingerprint, like
> > > > trigger
> > > > > rules or task start/end datetime. We should do a full pass of all
> DAG
> > > > > arguments and make sure we're not forgetting anything that can
> change
> > > > > scheduling logic. Also, let's be careful that something as simple
> as
> > a
> > > > > dynamic start or end date on a task could lead to a different
> version
> > > > each
> > > > > time you parse.
> > > >
> > > >
> > > >
> > > > The short version of Dag Fingerprinting would be
> > > > just a hash of the Serialized DAG.
> > > >
> > > > *Example DAG*: https://imgur.com/TVuoN3p
> > > > *Example Serialized DAG*: https://imgur.com/LmA2Bpr
> > > >
> > > > It contains all the task & DAG parameters. When they change,
> Scheduler
> > > > writes
> > > > a new version of Serialized DAGs to the DB. The Webserver then reads
> > the
> > > > DAGs from the DB.
> > > >
> > > > I'd recommend limiting serialization/storage of one version
> > > > > per DAG Run, as opposed to potentially everytime the DAG is parsed
> -
> > > once
> > > > > the version for a DAG run is pinned, fingerprinting is not
> > re-evaluated
> > > > > until the next DAG run is ready to get created.
> > > >
> > > >
> > > > This is to handle Scenario 3 where a DAG structure is changed
> mid-way.
> > > > Since we don't intend to
> > > > change the execution behaviour, if we limit Storage of 1 version per
> > DAG,
> > > > it won't actually show what
> > > > was run.
> > > >
> > > > Example Dag v1: Task A -> Task B -> Task C
> > > > The worker has completed the execution of Task B and is just about to
> > > > complete the execution of Task B.
> > > >
> > > > The 2nd version of DAG is deployed: Task A -> Task D
> > > > Now Scheduler queued Task D and it will run to completion. (Task C
> > won't
> > > > run)
> > > >
> > > > In this case, "the actual representation of the DAG" that run is
> > neither
> > > v1
> > > > nor v2 but a "mixed version"
> > > >  (Task A -> Task B -> Task D). The plan is that the Scheduler will
> > create
> > > > this "mixed version" based on what ran
> > > > and the Graph View would show this "mixed version".
> > > >
> > > > There would also be a toggle button on the Graph View to select v1 or
> > v2
> > > > where the tasks will be highlighted to show
> > > > that a particular task was in v1 or v2 as shown in
> > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/download/attachments/158868919/Picture%201.png?version=2&modificationDate=1595612863000&api=v2
> > > >
> > > >
> > > >
> > > > *Visualizing change in the tree view:* I think this is very complex
> and
> > > > > many things can make this view impossible to render (task
> dependency
> > > > > reversal, cycles across versions, ...). Maybe a better visual
> > approach
> > > > > would be to render independent, individual tree views for each DAG
> > > > version
> > > > > (side by side), and doing best effort aligning the tasks across
> > blocks
> > > > and
> > > > > "linking" tasks with lines across blocks when necessary.
> > > >
> > > >
> > > > Agreed, the plan is to do the best effort aligning.
> > > > At this point in time, task additions to the end of the DAG are
> > expected
> > > to
> > > > be compatible,
> > > > but changes to task structure within the DAG may cause the tree view
> > not
> > > to
> > > > incorporate “old” and “new” in the same view, hence that won't be
> > shown.
> > > >
> > > > Regards,
> > > > Kaxil
> > > >
> > > > On Mon, Jul 27, 2020 at 6:02 PM Maxime Beauchemin <
> > > > maximebeauche...@gmail.com> wrote:
> > > >
> > > > > Some notes and ideas:
> > > > >
> > > > > *DAG Fingerprinting: *this can be tricky, especially in regards to
> > > > dynamic
> > > > > DAGs, where in some cases each parsing of the DAG can result in a
> > > > different
> > > > > fingerprint. I think DAG and tasks attributes are left out from the
> > > > > proposal that should be considered as part of the fingerprint, like
> > > > trigger
> > > > > rules or task start/end datetime. We should do a full pass of all
> DAG
> > > > > arguments and make sure we're not forgetting anything that can
> change
> > > > > scheduling logic. Also, let's be careful that something as simple
> as
> > a
> > > > > dynamic start or end date on a task could lead to a different
> version
> > > > each
> > > > > time you parse. I'd recommend limiting serialization/storage of one
> > > > version
> > > > > per DAG Run, as opposed to potentially everytime the DAG is parsed
> -
> > > once
> > > > > the version for a DAG run is pinned, fingerprinting is not
> > re-evaluated
> > > > > until the next DAG run is ready to get created.
> > > > >
> > > > > *Visualizing change in the tree view:* I think this is very complex
> > and
> > > > > many things can make this view impossible to render (task
> dependency
> > > > > reversal, cycles across versions, ...). Maybe a better visual
> > approach
> > > > > would be to render independent, individual tree views for each DAG
> > > > version
> > > > > (side by side), and doing best effort aligning the tasks across
> > blocks
> > > > and
> > > > > "linking" tasks with lines across blocks when necessary.
> > > > >
> > > > > On Fri, Jul 24, 2020 at 12:46 PM Vikram Koka <vik...@astronomer.io
> >
> > > > wrote:
> > > > >
> > > > > > Team,
> > > > > >
> > > > > >
> > > > > >
> > > > > > We just created 'AIP-36 DAG Versioning' on Confluence and would
> > very
> > > > much
> > > > > > appreciate feedback and suggestions from the community.
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-36+DAG+Versioning
> > > > > >
> > > > > >
> > > > > >
> > > > > > The DAG Versioning concept has been discussed on multiple
> occasions
> > > in
> > > > > the
> > > > > > past and has been a topic highlighted as part of Airflow 2.0 as
> > well.
> > > > We
> > > > > at
> > > > > > Astronomer have heard data engineers at several enterprises ask
> > about
> > > > > this
> > > > > > feature as well, for easier debugging when changes are made to
> DAGs
> > > as
> > > > a
> > > > > > result of evolving business needs.
> > > > > >
> > > > > >
> > > > > > As described in the AIP, we have a proposal focused on ensuring
> > that
> > > > the
> > > > > > visibility behaviour of Airflow is correct, without changing the
> > > > > execution
> > > > > > behaviour. We considered changing the execution behaviour as
> well,
> > > but
> > > > > > decided that the risks in changing execution behavior were too
> high
> > > as
> > > > > > compared to the benefits and therefore decided to limit the scope
> > to
> > > > only
> > > > > > making sure that the visibility was correct.
> > > > > >
> > > > > >
> > > > > > We would like to attempt this based on our experience running
> > Airflow
> > > > as
> > > > > a
> > > > > > service. We believe that this benefits Airflow as a project and
> the
> > > > > > development experience of data engineers using Airflow across the
> > > > world.
> > > > > >
> > > > > >
> > > > > >  Any feedback, suggestions, and comments would be greatly
> > > appreciated.
> > > > > >
> > > > > >
> > > > > >
> > > > > > Best Regards,
> > > > > >
> > > > > >
> > > > > > Kaxil Naik, Ryan Hamilton, Ash Berlin-Taylor, and Vikram Koka
> > > > > >
> > > > >
> > > >
> > >
> >
>
>
> --
>
> Jacob Ward    |    Graduate Data Infrastructure Engineer
>
> jw...@brandwatch.com
>
>
> NEW YORK   | BOSTON   | BRIGHTON   | LONDON   | BERLIN |   STUTTGART |
> PARIS   | SINGAPORE | SYDNEY
>

Re: [PROPOSAL][AIP-36 DAG Versioning]

Reply via email to