Hey folks, Just reviving this old thread to provide an update that we (Astronomer) will be resurrecting AIP-36 DAG Versioning with a different scope in the coming days that will be more consistent with what has been discussed in this thread.
Regards, Kaxil On Thu, Aug 13, 2020 at 9:32 PM Jarek Potiuk <[email protected]> wrote: > I fully agree with the "user" not having to know any of the "wheel' > details. Similarly as they do not have to know python interpreter or the > underlying libc library details. This all should be hidden from the users. > > I think the wheels API that we might have there, does not have to be > user-facing. We could - rather easily - make a client that points to a DAG > file and builds appropriate wheel package under-the-hood and submits it. I > reallly doubt any of the users will directly use the API to submit DAGs - > they will use some clients built on top of it. > > I think we should separate the user side form the implementation - > similarly as we do not expect the users to know any details on how "DAG > Fetcher" should work - in any case with the DAG fetcher, we need to define > how DAG fetcher will make sure about "atomicity" anyway - how to make sure > that you get a "consistent" version of all the dependent python files when > you fetch them? This is the part of DAG fetcher that i do not like because > it assumes that "someone else" maintains the consistency and provides the > "consistent view" somewhere on the "DAG Server" side (whatever the server > side is). > > There were many ideas about some kind of manifest describing the files etc, > but I think all of that depends on some kind of ability of providing a > "snapshot" of files that will be consistent set to execute. With 'DAG > Fetcher" this is somthing that "DAG Fetching server" has to provide. It's > super easy if that "server" is GIT - we already use it for GIT sync. But > it's rather difficult to provide a good abstraction for it for "generic" > DAG fetcher. > > IMHO this is far easier to provide such consistent set at a "submission > time". In pretty-much all cases, the user submitting the job already has > consistent set of python files that the DAG uses. This is pretty much > given. I think the job of the "submission" mechanism is to make a > "snapshot" out of that consistent set and submit this snapshot, rather than > individual files. Git provides it out of the box, but if we want to be > generic - I see no other way than to build such "snapshot" locally. And > Wheels seems like a very good candidate - if only it's an implementation > detail and will be hidden from the users. > > J. > > > > > On Tue, Aug 11, 2020 at 8:33 PM Ash Berlin-Taylor <[email protected]> wrote: > > > Anything to doing with the process of building wheels should be a "power > > user" only feature, and should not be required for many users - many many > > users of airflow are not primarily Python developers, but data > scientists, > > and needing them to understand anything about the python build toolchain > is > > too much of a learning curve for the benefit. > > > > After all it is very rare that people hit the multiple concurrent > versions > > of a dag. > > > > -ash > > > > On 10 August 2020 17:37:32 BST, Tomasz Urbaszek <[email protected]> > > wrote: > > >I like the idea of wheels as this is probably the "most pythonic" > > >solution. And "DAG version" is not only defined by DAG code but also > > >by all dependencies the DAG uses (custom functions, libraries etc) and > > >it seems that wheels can address that. > > > > > >However, I second Ash - keeping wheels in db doesn't sound good. In my > > >opinion, DAG fetcher is the right solution and the idea surfaces every > > >time we talk about serialization. This abstraction has a lot of pros > > >as it allows a lot of customization (wheels, local fs, remote fs, > > >wheels etc). > > > > > >Apart from that, if we decided to use wheels we should provide a CLI > > >command to ease the process of building them. Also, I'm wondering > > >about developers' workflow. Moving between code of different DAG > > >version sounds easy if you use git but... what if someone doesn't use > > >it? > > > > > >Tomek > > > > > > > > >On Sat, Aug 8, 2020 at 9:49 AM Ash Berlin-Taylor <[email protected]> > > >wrote: > > >> > > >> Quick comment (as I'm still mostly on paternity leave): > > >> > > >> Storing wheels in the db sounds like a bad Idea to me, especially if > > >we need to store deps in there too (and if we don't store deps, then > > >they are incomplete) - they could get very large, and I've stored blobs > > >of ~10mb in postgres before: I don't recommend it. It "works" but > > >operating it is tricky. > > >> > > >> > > >> > > >> > the API could simply accept "Wheel file + the Dag id" > > >> > > >> This sounds like a huge security risk. > > >> > > >> > > >> My main concern with this idea is that it seems a lot of complexity > > >we are putting on users. Doubly so if they are already using docker > > >where there already exists an Ideal packaging and distribution that > > >could contain dag + needed code. > > >> > > >> (Sorry for the brevity) > > >> > > >> -ash > > >> > > >> > > >> On 2 August 2020 08:47:39 BST, Jarek Potiuk > > ><[email protected]> wrote: > > >> >Few points from my sid (and proposal!): > > >> > > > >> >1) Agree with Max - with a rather strong NO for pickles (however, > > >> >indeed cloudpickle solves some of the problems). Pickles came up in > > >> >our discussion in Polidea recently and the overall message was "no". > > >I > > >> >agree with Max here - if we can ship python code, turning that into > > >> >pickle for transit makes little sense to me and brings a plethora of > > >> >problems. > > >> > > > >> >2) I think indeed the versioning solution should treat the "DagRun" > > >> >structure atomically. While I see why we would like to go with the > > >> >UI/Scheduler only first rather than implementing them in the > > >workers, > > >> >adding the "mixed version" is where it breaks down IMHO. Reasoning > > >> >about such "mixed version" dag is next to impossible. The current > > >> >behavior is not well defined and non-deterministic (depends on > > >> >scheduler delays, syncing, type of deployment, restarts of the works > > >> >etc.) we are moving it up to UI (thus users) rather than solving the > > >> >problem. So I am not a big fan of this and would rather solve it > > >> >"well" with atomicity. > > >> > > > >> >3) I see the point of Dan as well - we had many discussions and many > > >> >times the idea about "submitting" the DAG for execution via the API > > >> >came up - and it makes sense IMHO. > > >> > > > >> >Proposal: Implement full versioning with code shipping via DB wheels > > >> >BLOB (akin to serialized DAGs). > > >> > > > >> >I understand that the big issue is how to actually "ship" the code > > >to > > >> >the worker. And - maybe a wild idea - we can kill several birds with > > >> >the same stone. > > >> > > > >> >There were plenty of discussions on how we could do that but one was > > >> >never truly explored - using wheel packages. > > >> > > > >> >For those who do not know them, there is the PEP: > > >> >https://www.python.org/dev/peps/pep-0427/ > > >> > > > >> >Wheels allow to "package" python code in a standard way. They are > > >> >portable ("purelib" + contain .py rather than .pyc code), they have > > >> >metadata, versioning information, they can be signed for security, > > >> >They can contain other packages or python code, Why don't we let > > >> >scheduler to pack the fingerprinted version of the DAG in a .whl and > > >> >store it as a blob in a DB next to the serialized form? > > >> > > > >> >There were concerns about the size of the code to keep in the DB - > > >but > > >> >we already use the DB for serialized DAGs and it works fine (I > > >believe > > >> >we only need to add compressing of the JSon serialized form - as > > >we've > > >> >learned from AirBnb during their talk at the Airflow Summit - wheels > > >> >are already compressed). Also - each task will only need the > > >> >particular "version" of one DAG so even if we keep many of them in > > >the > > >> >DB, the old version will pretty soon go "cold" and will never be > > >> >retrieved (and most DBs will handle it well with caching/indexes). > > >> > > > >> >And if we want to add "callables" from other files - there is > > >nothing > > >> >to stop the person who defines dag to add list of files that should > > >be > > >> >packaged together with the main DAG file (additional_python_files = > > >> >["common/my_fantastic_library.py"] in DAG constructor). Or we could > > >> >auto-add all files after the DAG gets imported (i.e. package > > >> >automatically all files that are imported for that particular DAG > > >from > > >> >the "dags" folder"). That should be rather easy. > > >> > > > >> >This way we could ship the code to workers for the exact version > > >that > > >> >the DagRun uses. And they can be cached and unpacked/installed to a > > >> >virtualenv for the execution of that single task. That should be > > >super > > >> >quick. Such virtualenv can be wiped out after execution. > > >> > > > >> >Then we got what Max wants (atomicity of DagRuns) and what Dan wants > > >> >(the API could simply accept "Wheel file + the Dag id". We have the > > >> >isolation between tasks running on the same worker (based on > > >> >virtualenv) so that each process in the same worker can run a > > >> >different version of the same Dag. We have much less confusion for > > >the > > >> >UI. > > >> > > > >> >Extra bonus 1: we can expand it to package different dependencies in > > >> >the wheels as well - so that if an operator requires a different > > >> >(newer) version of a python library, it could be packaged together > > >> >with the DAG in the same .whl file. This is also a highly requested > > >> >feature. > > >> >Extra bonus 2: workers will stop depending on the DAG file mount (!) > > >> >which was our long term goal and indeed as Dan mentioned - a great > > >> >step towards multi-tenancy. > > >> > > > >> >J. > > >> > > > >> > > > >> > > > >> > > > >> > > > >> > > > >> >On Fri, Jul 31, 2020 at 6:41 AM Maxime Beauchemin > > >> ><[email protected]> wrote: > > >> >> > > >> >> Having tried it early on, I'd advocate pretty strongly against > > >> >pickles and > > >> >> would rather not get too deep into the why here. Short story is > > >they > > >> >can > > >> >> pull the entire memory space or much more than you want, and it's > > >> >> impossible to reason about where they end. For that reason and > > >other > > >> >> reasons, they're a security issue. Oh and some objects are not > > >> >picklable > > >> >> (Jinja templates! to name a problematic one...). I've also seen > > >> >> secret-related classes that raise when pickled (thank god!). > > >> >> > > >> >> About callback and other things like that, it's quite a puzzle in > > >> >python. > > >> >> One solution would be to point to a python namespace > > >> >> callback="preset.airflow_utils.slack_callback" and assume the > > >> >function has > > >> >> to exist in the remote interpreter. Personally I like the > > >DagFetcher > > >> >idea > > >> >> (it could be great to get a pointer to that mailing list thread > > >> >here), > > >> >> specifically the GitDagFetcher. I don't know how [un]reasonable it > > >> >is, but > > >> >> I hate pickles so much that shipping source code around seems much > > >> >more > > >> >> reasonable to me. I think out there there's a talk from Mike Star > > >> >about > > >> >> Dataswarm at FB and he may mention how their workers may git > > >shallow > > >> >clone > > >> >> the pipeline repo. Or maybe they use that "beautifully ugly" hack > > >to > > >> >use > > >> >> a gitfs fuse [file system in user space] on the worker [could get > > >> >deeper > > >> >> into that, not sure how reasonable that is either]. > > >> >> > > >> >> About fingerprints, a simple `start_date = datetime.now() - > > >> >timedelta(1)` > > >> >> may lead to a never-repeating fingerprint. From memory the spec > > >> >doesn't > > >> >> list out the properties considered to build the hash. It be > > >helpful > > >> >to > > >> >> specify and review that list. > > >> >> > > >> >> Max > > >> >> > > >> >> On Wed, Jul 29, 2020 at 5:20 AM Kaxil Naik <[email protected]> > > >> >wrote: > > >> >> > > >> >> > Thanks, both Max and Dan for your comments, please check my > > >reply > > >> >below: > > >> >> > > > >> >> > > > >> >> > > Personally I vote for a DAG version to be pinned and > > >consistent > > >> >for the > > >> >> > > duration of the DAG run. Some of the reasons why: > > >> >> > > - it's easier to reason about, and therefore visualize and > > >> >troubleshoot > > >> >> > > - it prevents some cases where dependencies are never met > > >> >> > > - it prevents the explosion of artifact/metadata (one > > >> >serialization per > > >> >> > > dagrun as opposed to one per scheduler cycle) in the case of a > > >> >dynamic > > >> >> > DAG > > >> >> > > whose fingerprint is never the same. > > >> >> > > > >> >> > > > >> >> > In this AIP, we were only looking to fix the current "Viewing > > >> >behaviour" > > >> >> > and > > >> >> > we were intentionally not changing the execution behaviour. > > >> >> > The change you are suggesting means we need to introduce DAG > > >> >Versioning for > > >> >> > the > > >> >> > workers too. This will need more work as can't use the > > >Serialised > > >> >> > Representation > > >> >> > to run the task since users could use custom modules in a > > >different > > >> >part of > > >> >> > code, > > >> >> > example the PythonOperator has python_callable that allows > > >running > > >> >any > > >> >> > arbitrary code. > > >> >> > A similar case is with the *on_*_callbacks* defined on DAG. > > >> >> > > > >> >> > Based on the current scope of the AIP, we still plan to use the > > >> >actual DAG > > >> >> > files for the > > >> >> > execution and not use Serialized DAGs for the workers. > > >> >> > > > >> >> > To account for all the custom modules we will have to start > > >looking > > >> >at > > >> >> > pickle (cloudpickle). > > >> >> > > > >> >> > I'm certain that there are lots of > > >> >> > > those DAGs out there, and that it will overwhelm the metadata > > >> >database, > > >> >> > and > > >> >> > > confuse the users. For an hourly DAG is would mean 24 artifact > > >> >per day > > >> >> > > instead of 1000+ > > >> >> > > > >> >> > > > >> >> > What kind of dynamic DAGs are we talking about here, I would > > >think > > >> >the DAG > > >> >> > signature won't change > > >> >> > but I might be wrong, can you give an example, please. > > >> >> > > > >> >> > If backwards compatibility in behavior is a concern, I'd > > >recommend > > >> >adding a > > >> >> > > flag to the DAG class and/or config and make sure we're doing > > >the > > >> >right > > >> >> > > thing by default. People who want backward compatibility would > > >> >have to > > >> >> > > change that default. But again, that's a lot of extra and > > >> >confusing > > >> >> > > complexity that will likely be the source of bugs and user > > >> >confusion. > > >> >> > > Having a clear, easy to reason about execution model is super > > >> >important. > > >> >> > > > >> >> > Think about visualizing a DAG that shapeshifted 5 times during > > >its > > >> >> > > execution, how does anyone make sense of that? > > >> >> > > > >> >> > > > >> >> > Wouldn't that be an edge case? How often would someone change > > >the > > >> >DAG > > >> >> > structure in the middle of > > >> >> > a DAG execution. And since if they do change, the Graph View > > >should > > >> >show > > >> >> > all the tasks that were > > >> >> > run, if it just shows based on the latest version, the behaviour > > >> >would be > > >> >> > the same as now. > > >> >> > > > >> >> > -------- > > >> >> > > > >> >> > Strongly agree with Max's points, also I feel the right way to > > >go > > >> >about > > >> >> > > this is instead of Airflow schedulers/webservers/workers > > >reading > > >> >DAG > > >> >> > Python > > >> >> > > files, they would instead read from serialized representations > > >of > > >> >the > > >> >> > DAGs > > >> >> > > (e.g. json representation in the Airflow DB). Instead of DAG > > >> >owners > > >> >> > pushing > > >> >> > > their DAG files to the Airflow components via varying > > >mechanisms > > >> >(e.g. > > >> >> > > git), they would instead call an Airflow CLI to push the > > >> >serialized DAG > > >> >> > > representations to the DB, and for things like dynamic DAGs > > >you > > >> >could > > >> >> > > populate them from a DAG or another service. > > >> >> > > > >> >> > > > >> >> > Airflow Webserver and the Scheduler will definitely read from > > >the > > >> >> > Serialized representation as > > >> >> > they don't need all the code from the DAG files. > > >> >> > > > >> >> > While the workers definitely need access to DAG files as the > > >> >> > tasks/operators would be using > > >> >> > code form custom modules and classes which are required to run > > >the > > >> >tasks. > > >> >> > > > >> >> > If we do want to go down that route we will have to use > > >something > > >> >like > > >> >> > cloudpickle that serializes > > >> >> > entire DAG file and their dependencies. And also ensure that > > >> >someone is not > > >> >> > able to change the pickled > > >> >> > source when sending from executor to the worker as that poses a > > >big > > >> >> > security risk. > > >> >> > > > >> >> > - Kaxil > > >> >> > > > >> >> > On Wed, Jul 29, 2020 at 12:43 PM Jacob Ward > > ><[email protected]> > > >> >wrote: > > >> >> > > > >> >> > > I came here to say what Max has said, only less eloquently. > > >> >> > > > > >> >> > > I do have one concern with locking the version for a single > > >run. > > >> >> > Currently > > >> >> > > it is possible for a user to create a dag which intentionally > > >> >changes as > > >> >> > a > > >> >> > > dag executes, i.e. dynamically creating a task for the dag > > >during > > >> >a run > > >> >> > by > > >> >> > > modifying external data, but this change would prevent that. > > >I'm > > >> >of the > > >> >> > > opinion that this situation is bad practice anyway so it > > >doesn't > > >> >matter > > >> >> > if > > >> >> > > we make it impossible to do, but others may disagree. > > >> >> > > > > >> >> > > On Tue, 28 Jul 2020 at 17:08, Dan Davydov > > >> ><[email protected]> > > >> >> > > wrote: > > >> >> > > > > >> >> > > > Strongly agree with Max's points, also I feel the right way > > >to > > >> >go about > > >> >> > > > this is instead of Airflow schedulers/webservers/workers > > >> >reading DAG > > >> >> > > Python > > >> >> > > > files, they would instead read from serialized > > >representations > > >> >of the > > >> >> > > DAGs > > >> >> > > > (e.g. json representation in the Airflow DB). Instead of DAG > > >> >owners > > >> >> > > pushing > > >> >> > > > their DAG files to the Airflow components via varying > > >> >mechanisms (e.g. > > >> >> > > > git), they would instead call an Airflow CLI to push the > > >> >serialized DAG > > >> >> > > > representations to the DB, and for things like dynamic DAGs > > >you > > >> >could > > >> >> > > > populate them from a DAG or another service. > > >> >> > > > > > >> >> > > > This would also enable other features like stronger > > >> >> > > security/multi-tenancy. > > >> >> > > > > > >> >> > > > On Tue, Jul 28, 2020 at 6:44 PM Maxime Beauchemin < > > >> >> > > > [email protected]> wrote: > > >> >> > > > > > >> >> > > > > > "mixed version" > > >> >> > > > > > > >> >> > > > > Personally I vote for a DAG version to be pinned and > > >> >consistent for > > >> >> > the > > >> >> > > > > duration of the DAG run. Some of the reasons why: > > >> >> > > > > - it's easier to reason about, and therefore visualize and > > >> >> > troubleshoot > > >> >> > > > > - it prevents some cases where dependencies are never met > > >> >> > > > > - it prevents the explosion of artifact/metadata (one > > >> >serialization > > >> >> > per > > >> >> > > > > dagrun as opposed to one per scheduler cycle) in the case > > >of > > >> >a > > >> >> > dynamic > > >> >> > > > DAG > > >> >> > > > > whose fingerprint is never the same. I'm certain that > > >there > > >> >are lots > > >> >> > of > > >> >> > > > > those DAGs out there, and that it will overwhelm the > > >metadata > > >> >> > database, > > >> >> > > > and > > >> >> > > > > confuse the users. For an hourly DAG is would mean 24 > > >> >artifact per > > >> >> > day > > >> >> > > > > instead of 1000+ > > >> >> > > > > > > >> >> > > > > If backwards compatibility in behavior is a concern, I'd > > >> >recommend > > >> >> > > > adding a > > >> >> > > > > flag to the DAG class and/or config and make sure we're > > >doing > > >> >the > > >> >> > right > > >> >> > > > > thing by default. People who want backward compatibility > > >> >would have > > >> >> > to > > >> >> > > > > change that default. But again, that's a lot of extra and > > >> >confusing > > >> >> > > > > complexity that will likely be the source of bugs and user > > >> >confusion. > > >> >> > > > > Having a clear, easy to reason about execution model is > > >super > > >> >> > > important. > > >> >> > > > > > > >> >> > > > > Think about visualizing a DAG that shapeshifted 5 times > > >> >during its > > >> >> > > > > execution, how does anyone make sense of that? > > >> >> > > > > > > >> >> > > > > Max > > >> >> > > > > > > >> >> > > > > On Tue, Jul 28, 2020 at 3:14 AM Kaxil Naik > > >> ><[email protected]> > > >> >> > > wrote: > > >> >> > > > > > > >> >> > > > > > Thanks Max for your comments. > > >> >> > > > > > > > >> >> > > > > > > > >> >> > > > > > *DAG Fingerprinting: *this can be tricky, especially in > > >> >regards to > > >> >> > > > > dynamic > > >> >> > > > > > > DAGs, where in some cases each parsing of the DAG can > > >> >result in a > > >> >> > > > > > different > > >> >> > > > > > > fingerprint. I think DAG and tasks attributes are left > > >> >out from > > >> >> > the > > >> >> > > > > > > proposal that should be considered as part of the > > >> >fingerprint, > > >> >> > like > > >> >> > > > > > trigger > > >> >> > > > > > > rules or task start/end datetime. We should do a full > > >> >pass of all > > >> >> > > DAG > > >> >> > > > > > > arguments and make sure we're not forgetting anything > > >> >that can > > >> >> > > change > > >> >> > > > > > > scheduling logic. Also, let's be careful that > > >something > > >> >as simple > > >> >> > > as > > >> >> > > > a > > >> >> > > > > > > dynamic start or end date on a task could lead to a > > >> >different > > >> >> > > version > > >> >> > > > > > each > > >> >> > > > > > > time you parse. > > >> >> > > > > > > > >> >> > > > > > > > >> >> > > > > > > > >> >> > > > > > The short version of Dag Fingerprinting would be > > >> >> > > > > > just a hash of the Serialized DAG. > > >> >> > > > > > > > >> >> > > > > > *Example DAG*: https://imgur.com/TVuoN3p > > >> >> > > > > > *Example Serialized DAG*: https://imgur.com/LmA2Bpr > > >> >> > > > > > > > >> >> > > > > > It contains all the task & DAG parameters. When they > > >> >change, > > >> >> > > Scheduler > > >> >> > > > > > writes > > >> >> > > > > > a new version of Serialized DAGs to the DB. The > > >Webserver > > >> >then > > >> >> > reads > > >> >> > > > the > > >> >> > > > > > DAGs from the DB. > > >> >> > > > > > > > >> >> > > > > > I'd recommend limiting serialization/storage of one > > >version > > >> >> > > > > > > per DAG Run, as opposed to potentially everytime the > > >DAG > > >> >is > > >> >> > parsed > > >> >> > > - > > >> >> > > > > once > > >> >> > > > > > > the version for a DAG run is pinned, fingerprinting is > > >> >not > > >> >> > > > re-evaluated > > >> >> > > > > > > until the next DAG run is ready to get created. > > >> >> > > > > > > > >> >> > > > > > > > >> >> > > > > > This is to handle Scenario 3 where a DAG structure is > > >> >changed > > >> >> > > mid-way. > > >> >> > > > > > Since we don't intend to > > >> >> > > > > > change the execution behaviour, if we limit Storage of 1 > > >> >version > > >> >> > per > > >> >> > > > DAG, > > >> >> > > > > > it won't actually show what > > >> >> > > > > > was run. > > >> >> > > > > > > > >> >> > > > > > Example Dag v1: Task A -> Task B -> Task C > > >> >> > > > > > The worker has completed the execution of Task B and is > > >> >just about > > >> >> > to > > >> >> > > > > > complete the execution of Task B. > > >> >> > > > > > > > >> >> > > > > > The 2nd version of DAG is deployed: Task A -> Task D > > >> >> > > > > > Now Scheduler queued Task D and it will run to > > >completion. > > >> >(Task C > > >> >> > > > won't > > >> >> > > > > > run) > > >> >> > > > > > > > >> >> > > > > > In this case, "the actual representation of the DAG" > > >that > > >> >run is > > >> >> > > > neither > > >> >> > > > > v1 > > >> >> > > > > > nor v2 but a "mixed version" > > >> >> > > > > > (Task A -> Task B -> Task D). The plan is that the > > >> >Scheduler will > > >> >> > > > create > > >> >> > > > > > this "mixed version" based on what ran > > >> >> > > > > > and the Graph View would show this "mixed version". > > >> >> > > > > > > > >> >> > > > > > There would also be a toggle button on the Graph View to > > >> >select v1 > > >> >> > or > > >> >> > > > v2 > > >> >> > > > > > where the tasks will be highlighted to show > > >> >> > > > > > that a particular task was in v1 or v2 as shown in > > >> >> > > > > > > > >> >> > > > > > > > >> >> > > > > > > >> >> > > > > > >> >> > > > > >> >> > > > >> > > >> > > > https://cwiki.apache.org/confluence/download/attachments/158868919/Picture%201.png?version=2&modificationDate=1595612863000&api=v2 > > >> >> > > > > > > > >> >> > > > > > > > >> >> > > > > > > > >> >> > > > > > *Visualizing change in the tree view:* I think this is > > >very > > >> >complex > > >> >> > > and > > >> >> > > > > > > many things can make this view impossible to render > > >(task > > >> >> > > dependency > > >> >> > > > > > > reversal, cycles across versions, ...). Maybe a better > > >> >visual > > >> >> > > > approach > > >> >> > > > > > > would be to render independent, individual tree views > > >for > > >> >each > > >> >> > DAG > > >> >> > > > > > version > > >> >> > > > > > > (side by side), and doing best effort aligning the > > >tasks > > >> >across > > >> >> > > > blocks > > >> >> > > > > > and > > >> >> > > > > > > "linking" tasks with lines across blocks when > > >necessary. > > >> >> > > > > > > > >> >> > > > > > > > >> >> > > > > > Agreed, the plan is to do the best effort aligning. > > >> >> > > > > > At this point in time, task additions to the end of the > > >DAG > > >> >are > > >> >> > > > expected > > >> >> > > > > to > > >> >> > > > > > be compatible, > > >> >> > > > > > but changes to task structure within the DAG may cause > > >the > > >> >tree > > >> >> > view > > >> >> > > > not > > >> >> > > > > to > > >> >> > > > > > incorporate “old” and “new” in the same view, hence that > > >> >won't be > > >> >> > > > shown. > > >> >> > > > > > > > >> >> > > > > > Regards, > > >> >> > > > > > Kaxil > > >> >> > > > > > > > >> >> > > > > > On Mon, Jul 27, 2020 at 6:02 PM Maxime Beauchemin < > > >> >> > > > > > [email protected]> wrote: > > >> >> > > > > > > > >> >> > > > > > > Some notes and ideas: > > >> >> > > > > > > > > >> >> > > > > > > *DAG Fingerprinting: *this can be tricky, especially > > >in > > >> >regards > > >> >> > to > > >> >> > > > > > dynamic > > >> >> > > > > > > DAGs, where in some cases each parsing of the DAG can > > >> >result in a > > >> >> > > > > > different > > >> >> > > > > > > fingerprint. I think DAG and tasks attributes are left > > >> >out from > > >> >> > the > > >> >> > > > > > > proposal that should be considered as part of the > > >> >fingerprint, > > >> >> > like > > >> >> > > > > > trigger > > >> >> > > > > > > rules or task start/end datetime. We should do a full > > >> >pass of all > > >> >> > > DAG > > >> >> > > > > > > arguments and make sure we're not forgetting anything > > >> >that can > > >> >> > > change > > >> >> > > > > > > scheduling logic. Also, let's be careful that > > >something > > >> >as simple > > >> >> > > as > > >> >> > > > a > > >> >> > > > > > > dynamic start or end date on a task could lead to a > > >> >different > > >> >> > > version > > >> >> > > > > > each > > >> >> > > > > > > time you parse. I'd recommend limiting > > >> >serialization/storage of > > >> >> > one > > >> >> > > > > > version > > >> >> > > > > > > per DAG Run, as opposed to potentially everytime the > > >DAG > > >> >is > > >> >> > parsed > > >> >> > > - > > >> >> > > > > once > > >> >> > > > > > > the version for a DAG run is pinned, fingerprinting is > > >> >not > > >> >> > > > re-evaluated > > >> >> > > > > > > until the next DAG run is ready to get created. > > >> >> > > > > > > > > >> >> > > > > > > *Visualizing change in the tree view:* I think this is > > >> >very > > >> >> > complex > > >> >> > > > and > > >> >> > > > > > > many things can make this view impossible to render > > >(task > > >> >> > > dependency > > >> >> > > > > > > reversal, cycles across versions, ...). Maybe a better > > >> >visual > > >> >> > > > approach > > >> >> > > > > > > would be to render independent, individual tree views > > >for > > >> >each > > >> >> > DAG > > >> >> > > > > > version > > >> >> > > > > > > (side by side), and doing best effort aligning the > > >tasks > > >> >across > > >> >> > > > blocks > > >> >> > > > > > and > > >> >> > > > > > > "linking" tasks with lines across blocks when > > >necessary. > > >> >> > > > > > > > > >> >> > > > > > > On Fri, Jul 24, 2020 at 12:46 PM Vikram Koka < > > >> >> > [email protected] > > >> >> > > > > > >> >> > > > > > wrote: > > >> >> > > > > > > > > >> >> > > > > > > > Team, > > >> >> > > > > > > > > > >> >> > > > > > > > > > >> >> > > > > > > > > > >> >> > > > > > > > We just created 'AIP-36 DAG Versioning' on > > >Confluence > > >> >and would > > >> >> > > > very > > >> >> > > > > > much > > >> >> > > > > > > > appreciate feedback and suggestions from the > > >community. > > >> >> > > > > > > > > > >> >> > > > > > > > > > >> >> > > > > > > > > > >> >> > > > > > > > > > >> >> > > > > > > > > >> >> > > > > > > > >> >> > > > > > > >> >> > > > > > >> >> > > > > >> >> > > > >> > > >> > > > https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-36+DAG+Versioning > > >> >> > > > > > > > > > >> >> > > > > > > > > > >> >> > > > > > > > > > >> >> > > > > > > > The DAG Versioning concept has been discussed on > > >> >multiple > > >> >> > > occasions > > >> >> > > > > in > > >> >> > > > > > > the > > >> >> > > > > > > > past and has been a topic highlighted as part of > > >> >Airflow 2.0 as > > >> >> > > > well. > > >> >> > > > > > We > > >> >> > > > > > > at > > >> >> > > > > > > > Astronomer have heard data engineers at several > > >> >enterprises ask > > >> >> > > > about > > >> >> > > > > > > this > > >> >> > > > > > > > feature as well, for easier debugging when changes > > >are > > >> >made to > > >> >> > > DAGs > > >> >> > > > > as > > >> >> > > > > > a > > >> >> > > > > > > > result of evolving business needs. > > >> >> > > > > > > > > > >> >> > > > > > > > > > >> >> > > > > > > > As described in the AIP, we have a proposal focused > > >on > > >> >ensuring > > >> >> > > > that > > >> >> > > > > > the > > >> >> > > > > > > > visibility behaviour of Airflow is correct, without > > >> >changing > > >> >> > the > > >> >> > > > > > > execution > > >> >> > > > > > > > behaviour. We considered changing the execution > > >> >behaviour as > > >> >> > > well, > > >> >> > > > > but > > >> >> > > > > > > > decided that the risks in changing execution > > >behavior > > >> >were too > > >> >> > > high > > >> >> > > > > as > > >> >> > > > > > > > compared to the benefits and therefore decided to > > >limit > > >> >the > > >> >> > scope > > >> >> > > > to > > >> >> > > > > > only > > >> >> > > > > > > > making sure that the visibility was correct. > > >> >> > > > > > > > > > >> >> > > > > > > > > > >> >> > > > > > > > We would like to attempt this based on our > > >experience > > >> >running > > >> >> > > > Airflow > > >> >> > > > > > as > > >> >> > > > > > > a > > >> >> > > > > > > > service. We believe that this benefits Airflow as a > > >> >project and > > >> >> > > the > > >> >> > > > > > > > development experience of data engineers using > > >Airflow > > >> >across > > >> >> > the > > >> >> > > > > > world. > > >> >> > > > > > > > > > >> >> > > > > > > > > > >> >> > > > > > > > Any feedback, suggestions, and comments would be > > >> >greatly > > >> >> > > > > appreciated. > > >> >> > > > > > > > > > >> >> > > > > > > > > > >> >> > > > > > > > > > >> >> > > > > > > > Best Regards, > > >> >> > > > > > > > > > >> >> > > > > > > > > > >> >> > > > > > > > Kaxil Naik, Ryan Hamilton, Ash Berlin-Taylor, and > > >> >Vikram Koka > > >> >> > > > > > > > > > >> >> > > > > > > > > >> >> > > > > > > > >> >> > > > > > > >> >> > > > > > >> >> > > > > >> >> > > > > >> >> > > -- > > >> >> > > > > >> >> > > Jacob Ward | Graduate Data Infrastructure Engineer > > >> >> > > > > >> >> > > [email protected] > > >> >> > > > > >> >> > > > > >> >> > > NEW YORK | BOSTON | BRIGHTON | LONDON | BERLIN | > > >> >STUTTGART | > > >> >> > > PARIS | SINGAPORE | SYDNEY > > >> >> > > > > >> >> > > > >> > > > >> > > > >> > > > >> >-- > > >> > > > >> >Jarek Potiuk > > >> >Polidea | Principal Software Engineer > > >> > > > >> >M: +48 660 796 129 > > > > > -- > > Jarek Potiuk > Polidea <https://www.polidea.com/> | Principal Software Engineer > > M: +48 660 796 129 <+48660796129> > [image: Polidea] <https://www.polidea.com/> >
