I think Airflow Summit and some 2.3.0 teething had (un) successfully :) dragged most of the committers from the few AIPs, but I believe there will shortly be a real "reinvigorating" of some work there (speaking for myself though :)).
On Fri, May 27, 2022 at 3:28 AM Max Payton <[email protected]> wrote: > Hey, I was wondering if the resurrected AIP was ever published? This is > something that we (Lyft) are very interested in, and would like to > contribute to as well. > *Max Payton* > He/Him/His > Software Engineer > 202.441.7757 <+12024417757> > [image: Lyft] <http://www.lyft.com/> > > > On Tue, Feb 15, 2022 at 4:23 AM Jarek Potiuk <[email protected]> wrote: > >> Woohoo! Looking forward to it! >> >> On Tue, Feb 15, 2022 at 1:11 PM Kaxil Naik <[email protected]> wrote: >> > >> > Hey folks, >> > >> > Just reviving this old thread to provide an update that we (Astronomer) >> will be resurrecting AIP-36 DAG Versioning with a different scope in the >> coming days that will be more consistent with what has been discussed in >> this thread. >> > >> > Regards, >> > Kaxil >> > >> > On Thu, Aug 13, 2020 at 9:32 PM Jarek Potiuk <[email protected]> >> wrote: >> >> >> >> I fully agree with the "user" not having to know any of the "wheel' >> >> details. Similarly as they do not have to know python interpreter or >> the >> >> underlying libc library details. This all should be hidden from the >> users. >> >> >> >> I think the wheels API that we might have there, does not have to be >> >> user-facing. We could - rather easily - make a client that points to a >> DAG >> >> file and builds appropriate wheel package under-the-hood and submits >> it. I >> >> reallly doubt any of the users will directly use the API to submit >> DAGs - >> >> they will use some clients built on top of it. >> >> >> >> I think we should separate the user side form the implementation - >> >> similarly as we do not expect the users to know any details on how "DAG >> >> Fetcher" should work - in any case with the DAG fetcher, we need to >> define >> >> how DAG fetcher will make sure about "atomicity" anyway - how to make >> sure >> >> that you get a "consistent" version of all the dependent python files >> when >> >> you fetch them? This is the part of DAG fetcher that i do not like >> because >> >> it assumes that "someone else" maintains the consistency and provides >> the >> >> "consistent view" somewhere on the "DAG Server" side (whatever the >> server >> >> side is). >> >> >> >> There were many ideas about some kind of manifest describing the files >> etc, >> >> but I think all of that depends on some kind of ability of providing a >> >> "snapshot" of files that will be consistent set to execute. With 'DAG >> >> Fetcher" this is somthing that "DAG Fetching server" has to provide. >> It's >> >> super easy if that "server" is GIT - we already use it for GIT sync. >> But >> >> it's rather difficult to provide a good abstraction for it for >> "generic" >> >> DAG fetcher. >> >> >> >> IMHO this is far easier to provide such consistent set at a "submission >> >> time". In pretty-much all cases, the user submitting the job already >> has >> >> consistent set of python files that the DAG uses. This is pretty much >> >> given. I think the job of the "submission" mechanism is to make a >> >> "snapshot" out of that consistent set and submit this snapshot, rather >> than >> >> individual files. Git provides it out of the box, but if we want to be >> >> generic - I see no other way than to build such "snapshot" locally. And >> >> Wheels seems like a very good candidate - if only it's an >> implementation >> >> detail and will be hidden from the users. >> >> >> >> J. >> >> >> >> >> >> >> >> >> >> On Tue, Aug 11, 2020 at 8:33 PM Ash Berlin-Taylor <[email protected]> >> wrote: >> >> >> >> > Anything to doing with the process of building wheels should be a >> "power >> >> > user" only feature, and should not be required for many users - many >> many >> >> > users of airflow are not primarily Python developers, but data >> scientists, >> >> > and needing them to understand anything about the python build >> toolchain is >> >> > too much of a learning curve for the benefit. >> >> > >> >> > After all it is very rare that people hit the multiple concurrent >> versions >> >> > of a dag. >> >> > >> >> > -ash >> >> > >> >> > On 10 August 2020 17:37:32 BST, Tomasz Urbaszek < >> [email protected]> >> >> > wrote: >> >> > >I like the idea of wheels as this is probably the "most pythonic" >> >> > >solution. And "DAG version" is not only defined by DAG code but also >> >> > >by all dependencies the DAG uses (custom functions, libraries etc) >> and >> >> > >it seems that wheels can address that. >> >> > > >> >> > >However, I second Ash - keeping wheels in db doesn't sound good. In >> my >> >> > >opinion, DAG fetcher is the right solution and the idea surfaces >> every >> >> > >time we talk about serialization. This abstraction has a lot of pros >> >> > >as it allows a lot of customization (wheels, local fs, remote fs, >> >> > >wheels etc). >> >> > > >> >> > >Apart from that, if we decided to use wheels we should provide a CLI >> >> > >command to ease the process of building them. Also, I'm wondering >> >> > >about developers' workflow. Moving between code of different DAG >> >> > >version sounds easy if you use git but... what if someone doesn't >> use >> >> > >it? >> >> > > >> >> > >Tomek >> >> > > >> >> > > >> >> > >On Sat, Aug 8, 2020 at 9:49 AM Ash Berlin-Taylor <[email protected]> >> >> > >wrote: >> >> > >> >> >> > >> Quick comment (as I'm still mostly on paternity leave): >> >> > >> >> >> > >> Storing wheels in the db sounds like a bad Idea to me, especially >> if >> >> > >we need to store deps in there too (and if we don't store deps, then >> >> > >they are incomplete) - they could get very large, and I've stored >> blobs >> >> > >of ~10mb in postgres before: I don't recommend it. It "works" but >> >> > >operating it is tricky. >> >> > >> >> >> > >> >> >> > >> >> >> > >> > the API could simply accept "Wheel file + the Dag id" >> >> > >> >> >> > >> This sounds like a huge security risk. >> >> > >> >> >> > >> >> >> > >> My main concern with this idea is that it seems a lot of >> complexity >> >> > >we are putting on users. Doubly so if they are already using docker >> >> > >where there already exists an Ideal packaging and distribution that >> >> > >could contain dag + needed code. >> >> > >> >> >> > >> (Sorry for the brevity) >> >> > >> >> >> > >> -ash >> >> > >> >> >> > >> >> >> > >> On 2 August 2020 08:47:39 BST, Jarek Potiuk >> >> > ><[email protected]> wrote: >> >> > >> >Few points from my sid (and proposal!): >> >> > >> > >> >> > >> >1) Agree with Max - with a rather strong NO for pickles >> (however, >> >> > >> >indeed cloudpickle solves some of the problems). Pickles came up >> in >> >> > >> >our discussion in Polidea recently and the overall message was >> "no". >> >> > >I >> >> > >> >agree with Max here - if we can ship python code, turning that >> into >> >> > >> >pickle for transit makes little sense to me and brings a >> plethora of >> >> > >> >problems. >> >> > >> > >> >> > >> >2) I think indeed the versioning solution should treat the >> "DagRun" >> >> > >> >structure atomically. While I see why we would like to go with >> the >> >> > >> >UI/Scheduler only first rather than implementing them in the >> >> > >workers, >> >> > >> >adding the "mixed version" is where it breaks down IMHO. >> Reasoning >> >> > >> >about such "mixed version" dag is next to impossible. The current >> >> > >> >behavior is not well defined and non-deterministic (depends on >> >> > >> >scheduler delays, syncing, type of deployment, restarts of the >> works >> >> > >> >etc.) we are moving it up to UI (thus users) rather than solving >> the >> >> > >> >problem. So I am not a big fan of this and would rather solve it >> >> > >> >"well" with atomicity. >> >> > >> > >> >> > >> >3) I see the point of Dan as well - we had many discussions and >> many >> >> > >> >times the idea about "submitting" the DAG for execution via the >> API >> >> > >> >came up - and it makes sense IMHO. >> >> > >> > >> >> > >> >Proposal: Implement full versioning with code shipping via DB >> wheels >> >> > >> >BLOB (akin to serialized DAGs). >> >> > >> > >> >> > >> >I understand that the big issue is how to actually "ship" the >> code >> >> > >to >> >> > >> >the worker. And - maybe a wild idea - we can kill several birds >> with >> >> > >> >the same stone. >> >> > >> > >> >> > >> >There were plenty of discussions on how we could do that but one >> was >> >> > >> >never truly explored - using wheel packages. >> >> > >> > >> >> > >> >For those who do not know them, there is the PEP: >> >> > >> >https://www.python.org/dev/peps/pep-0427/ >> >> > >> > >> >> > >> >Wheels allow to "package" python code in a standard way. They are >> >> > >> >portable ("purelib" + contain .py rather than .pyc code), they >> have >> >> > >> >metadata, versioning information, they can be signed for >> security, >> >> > >> >They can contain other packages or python code, Why don't we let >> >> > >> >scheduler to pack the fingerprinted version of the DAG in a .whl >> and >> >> > >> >store it as a blob in a DB next to the serialized form? >> >> > >> > >> >> > >> >There were concerns about the size of the code to keep in the DB >> - >> >> > >but >> >> > >> >we already use the DB for serialized DAGs and it works fine (I >> >> > >believe >> >> > >> >we only need to add compressing of the JSon serialized form - as >> >> > >we've >> >> > >> >learned from AirBnb during their talk at the Airflow Summit - >> wheels >> >> > >> >are already compressed). Also - each task will only need the >> >> > >> >particular "version" of one DAG so even if we keep many of them >> in >> >> > >the >> >> > >> >DB, the old version will pretty soon go "cold" and will never be >> >> > >> >retrieved (and most DBs will handle it well with >> caching/indexes). >> >> > >> > >> >> > >> >And if we want to add "callables" from other files - there is >> >> > >nothing >> >> > >> >to stop the person who defines dag to add list of files that >> should >> >> > >be >> >> > >> >packaged together with the main DAG file >> (additional_python_files = >> >> > >> >["common/my_fantastic_library.py"] in DAG constructor). Or we >> could >> >> > >> >auto-add all files after the DAG gets imported (i.e. package >> >> > >> >automatically all files that are imported for that particular DAG >> >> > >from >> >> > >> >the "dags" folder"). That should be rather easy. >> >> > >> > >> >> > >> >This way we could ship the code to workers for the exact version >> >> > >that >> >> > >> >the DagRun uses. And they can be cached and unpacked/installed >> to a >> >> > >> >virtualenv for the execution of that single task. That should be >> >> > >super >> >> > >> >quick. Such virtualenv can be wiped out after execution. >> >> > >> > >> >> > >> >Then we got what Max wants (atomicity of DagRuns) and what Dan >> wants >> >> > >> >(the API could simply accept "Wheel file + the Dag id". We have >> the >> >> > >> >isolation between tasks running on the same worker (based on >> >> > >> >virtualenv) so that each process in the same worker can run a >> >> > >> >different version of the same Dag. We have much less confusion >> for >> >> > >the >> >> > >> >UI. >> >> > >> > >> >> > >> >Extra bonus 1: we can expand it to package different >> dependencies in >> >> > >> >the wheels as well - so that if an operator requires a different >> >> > >> >(newer) version of a python library, it could be packaged >> together >> >> > >> >with the DAG in the same .whl file. This is also a highly >> requested >> >> > >> >feature. >> >> > >> >Extra bonus 2: workers will stop depending on the DAG file mount >> (!) >> >> > >> >which was our long term goal and indeed as Dan mentioned - a >> great >> >> > >> >step towards multi-tenancy. >> >> > >> > >> >> > >> >J. >> >> > >> > >> >> > >> > >> >> > >> > >> >> > >> > >> >> > >> > >> >> > >> > >> >> > >> >On Fri, Jul 31, 2020 at 6:41 AM Maxime Beauchemin >> >> > >> ><[email protected]> wrote: >> >> > >> >> >> >> > >> >> Having tried it early on, I'd advocate pretty strongly against >> >> > >> >pickles and >> >> > >> >> would rather not get too deep into the why here. Short story is >> >> > >they >> >> > >> >can >> >> > >> >> pull the entire memory space or much more than you want, and >> it's >> >> > >> >> impossible to reason about where they end. For that reason and >> >> > >other >> >> > >> >> reasons, they're a security issue. Oh and some objects are not >> >> > >> >picklable >> >> > >> >> (Jinja templates! to name a problematic one...). I've also seen >> >> > >> >> secret-related classes that raise when pickled (thank god!). >> >> > >> >> >> >> > >> >> About callback and other things like that, it's quite a puzzle >> in >> >> > >> >python. >> >> > >> >> One solution would be to point to a python namespace >> >> > >> >> callback="preset.airflow_utils.slack_callback" and assume the >> >> > >> >function has >> >> > >> >> to exist in the remote interpreter. Personally I like the >> >> > >DagFetcher >> >> > >> >idea >> >> > >> >> (it could be great to get a pointer to that mailing list thread >> >> > >> >here), >> >> > >> >> specifically the GitDagFetcher. I don't know how >> [un]reasonable it >> >> > >> >is, but >> >> > >> >> I hate pickles so much that shipping source code around seems >> much >> >> > >> >more >> >> > >> >> reasonable to me. I think out there there's a talk from Mike >> Star >> >> > >> >about >> >> > >> >> Dataswarm at FB and he may mention how their workers may git >> >> > >shallow >> >> > >> >clone >> >> > >> >> the pipeline repo. Or maybe they use that "beautifully ugly" >> hack >> >> > >to >> >> > >> >use >> >> > >> >> a gitfs fuse [file system in user space] on the worker [could >> get >> >> > >> >deeper >> >> > >> >> into that, not sure how reasonable that is either]. >> >> > >> >> >> >> > >> >> About fingerprints, a simple `start_date = datetime.now() - >> >> > >> >timedelta(1)` >> >> > >> >> may lead to a never-repeating fingerprint. From memory the spec >> >> > >> >doesn't >> >> > >> >> list out the properties considered to build the hash. It be >> >> > >helpful >> >> > >> >to >> >> > >> >> specify and review that list. >> >> > >> >> >> >> > >> >> Max >> >> > >> >> >> >> > >> >> On Wed, Jul 29, 2020 at 5:20 AM Kaxil Naik < >> [email protected]> >> >> > >> >wrote: >> >> > >> >> >> >> > >> >> > Thanks, both Max and Dan for your comments, please check my >> >> > >reply >> >> > >> >below: >> >> > >> >> > >> >> > >> >> > >> >> > >> >> > > Personally I vote for a DAG version to be pinned and >> >> > >consistent >> >> > >> >for the >> >> > >> >> > > duration of the DAG run. Some of the reasons why: >> >> > >> >> > > - it's easier to reason about, and therefore visualize and >> >> > >> >troubleshoot >> >> > >> >> > > - it prevents some cases where dependencies are never met >> >> > >> >> > > - it prevents the explosion of artifact/metadata (one >> >> > >> >serialization per >> >> > >> >> > > dagrun as opposed to one per scheduler cycle) in the case >> of a >> >> > >> >dynamic >> >> > >> >> > DAG >> >> > >> >> > > whose fingerprint is never the same. >> >> > >> >> > >> >> > >> >> > >> >> > >> >> > In this AIP, we were only looking to fix the current "Viewing >> >> > >> >behaviour" >> >> > >> >> > and >> >> > >> >> > we were intentionally not changing the execution behaviour. >> >> > >> >> > The change you are suggesting means we need to introduce DAG >> >> > >> >Versioning for >> >> > >> >> > the >> >> > >> >> > workers too. This will need more work as can't use the >> >> > >Serialised >> >> > >> >> > Representation >> >> > >> >> > to run the task since users could use custom modules in a >> >> > >different >> >> > >> >part of >> >> > >> >> > code, >> >> > >> >> > example the PythonOperator has python_callable that allows >> >> > >running >> >> > >> >any >> >> > >> >> > arbitrary code. >> >> > >> >> > A similar case is with the *on_*_callbacks* defined on DAG. >> >> > >> >> > >> >> > >> >> > Based on the current scope of the AIP, we still plan to use >> the >> >> > >> >actual DAG >> >> > >> >> > files for the >> >> > >> >> > execution and not use Serialized DAGs for the workers. >> >> > >> >> > >> >> > >> >> > To account for all the custom modules we will have to start >> >> > >looking >> >> > >> >at >> >> > >> >> > pickle (cloudpickle). >> >> > >> >> > >> >> > >> >> > I'm certain that there are lots of >> >> > >> >> > > those DAGs out there, and that it will overwhelm the >> metadata >> >> > >> >database, >> >> > >> >> > and >> >> > >> >> > > confuse the users. For an hourly DAG is would mean 24 >> artifact >> >> > >> >per day >> >> > >> >> > > instead of 1000+ >> >> > >> >> > >> >> > >> >> > >> >> > >> >> > What kind of dynamic DAGs are we talking about here, I would >> >> > >think >> >> > >> >the DAG >> >> > >> >> > signature won't change >> >> > >> >> > but I might be wrong, can you give an example, please. >> >> > >> >> > >> >> > >> >> > If backwards compatibility in behavior is a concern, I'd >> >> > >recommend >> >> > >> >adding a >> >> > >> >> > > flag to the DAG class and/or config and make sure we're >> doing >> >> > >the >> >> > >> >right >> >> > >> >> > > thing by default. People who want backward compatibility >> would >> >> > >> >have to >> >> > >> >> > > change that default. But again, that's a lot of extra and >> >> > >> >confusing >> >> > >> >> > > complexity that will likely be the source of bugs and user >> >> > >> >confusion. >> >> > >> >> > > Having a clear, easy to reason about execution model is >> super >> >> > >> >important. >> >> > >> >> > >> >> > >> >> > Think about visualizing a DAG that shapeshifted 5 times >> during >> >> > >its >> >> > >> >> > > execution, how does anyone make sense of that? >> >> > >> >> > >> >> > >> >> > >> >> > >> >> > Wouldn't that be an edge case? How often would someone change >> >> > >the >> >> > >> >DAG >> >> > >> >> > structure in the middle of >> >> > >> >> > a DAG execution. And since if they do change, the Graph View >> >> > >should >> >> > >> >show >> >> > >> >> > all the tasks that were >> >> > >> >> > run, if it just shows based on the latest version, the >> behaviour >> >> > >> >would be >> >> > >> >> > the same as now. >> >> > >> >> > >> >> > >> >> > -------- >> >> > >> >> > >> >> > >> >> > Strongly agree with Max's points, also I feel the right way >> to >> >> > >go >> >> > >> >about >> >> > >> >> > > this is instead of Airflow schedulers/webservers/workers >> >> > >reading >> >> > >> >DAG >> >> > >> >> > Python >> >> > >> >> > > files, they would instead read from serialized >> representations >> >> > >of >> >> > >> >the >> >> > >> >> > DAGs >> >> > >> >> > > (e.g. json representation in the Airflow DB). Instead of >> DAG >> >> > >> >owners >> >> > >> >> > pushing >> >> > >> >> > > their DAG files to the Airflow components via varying >> >> > >mechanisms >> >> > >> >(e.g. >> >> > >> >> > > git), they would instead call an Airflow CLI to push the >> >> > >> >serialized DAG >> >> > >> >> > > representations to the DB, and for things like dynamic DAGs >> >> > >you >> >> > >> >could >> >> > >> >> > > populate them from a DAG or another service. >> >> > >> >> > >> >> > >> >> > >> >> > >> >> > Airflow Webserver and the Scheduler will definitely read from >> >> > >the >> >> > >> >> > Serialized representation as >> >> > >> >> > they don't need all the code from the DAG files. >> >> > >> >> > >> >> > >> >> > While the workers definitely need access to DAG files as the >> >> > >> >> > tasks/operators would be using >> >> > >> >> > code form custom modules and classes which are required to >> run >> >> > >the >> >> > >> >tasks. >> >> > >> >> > >> >> > >> >> > If we do want to go down that route we will have to use >> >> > >something >> >> > >> >like >> >> > >> >> > cloudpickle that serializes >> >> > >> >> > entire DAG file and their dependencies. And also ensure that >> >> > >> >someone is not >> >> > >> >> > able to change the pickled >> >> > >> >> > source when sending from executor to the worker as that >> poses a >> >> > >big >> >> > >> >> > security risk. >> >> > >> >> > >> >> > >> >> > - Kaxil >> >> > >> >> > >> >> > >> >> > On Wed, Jul 29, 2020 at 12:43 PM Jacob Ward >> >> > ><[email protected]> >> >> > >> >wrote: >> >> > >> >> > >> >> > >> >> > > I came here to say what Max has said, only less eloquently. >> >> > >> >> > > >> >> > >> >> > > I do have one concern with locking the version for a single >> >> > >run. >> >> > >> >> > Currently >> >> > >> >> > > it is possible for a user to create a dag which >> intentionally >> >> > >> >changes as >> >> > >> >> > a >> >> > >> >> > > dag executes, i.e. dynamically creating a task for the dag >> >> > >during >> >> > >> >a run >> >> > >> >> > by >> >> > >> >> > > modifying external data, but this change would prevent >> that. >> >> > >I'm >> >> > >> >of the >> >> > >> >> > > opinion that this situation is bad practice anyway so it >> >> > >doesn't >> >> > >> >matter >> >> > >> >> > if >> >> > >> >> > > we make it impossible to do, but others may disagree. >> >> > >> >> > > >> >> > >> >> > > On Tue, 28 Jul 2020 at 17:08, Dan Davydov >> >> > >> ><[email protected]> >> >> > >> >> > > wrote: >> >> > >> >> > > >> >> > >> >> > > > Strongly agree with Max's points, also I feel the right >> way >> >> > >to >> >> > >> >go about >> >> > >> >> > > > this is instead of Airflow schedulers/webservers/workers >> >> > >> >reading DAG >> >> > >> >> > > Python >> >> > >> >> > > > files, they would instead read from serialized >> >> > >representations >> >> > >> >of the >> >> > >> >> > > DAGs >> >> > >> >> > > > (e.g. json representation in the Airflow DB). Instead of >> DAG >> >> > >> >owners >> >> > >> >> > > pushing >> >> > >> >> > > > their DAG files to the Airflow components via varying >> >> > >> >mechanisms (e.g. >> >> > >> >> > > > git), they would instead call an Airflow CLI to push the >> >> > >> >serialized DAG >> >> > >> >> > > > representations to the DB, and for things like dynamic >> DAGs >> >> > >you >> >> > >> >could >> >> > >> >> > > > populate them from a DAG or another service. >> >> > >> >> > > > >> >> > >> >> > > > This would also enable other features like stronger >> >> > >> >> > > security/multi-tenancy. >> >> > >> >> > > > >> >> > >> >> > > > On Tue, Jul 28, 2020 at 6:44 PM Maxime Beauchemin < >> >> > >> >> > > > [email protected]> wrote: >> >> > >> >> > > > >> >> > >> >> > > > > > "mixed version" >> >> > >> >> > > > > >> >> > >> >> > > > > Personally I vote for a DAG version to be pinned and >> >> > >> >consistent for >> >> > >> >> > the >> >> > >> >> > > > > duration of the DAG run. Some of the reasons why: >> >> > >> >> > > > > - it's easier to reason about, and therefore visualize >> and >> >> > >> >> > troubleshoot >> >> > >> >> > > > > - it prevents some cases where dependencies are never >> met >> >> > >> >> > > > > - it prevents the explosion of artifact/metadata (one >> >> > >> >serialization >> >> > >> >> > per >> >> > >> >> > > > > dagrun as opposed to one per scheduler cycle) in the >> case >> >> > >of >> >> > >> >a >> >> > >> >> > dynamic >> >> > >> >> > > > DAG >> >> > >> >> > > > > whose fingerprint is never the same. I'm certain that >> >> > >there >> >> > >> >are lots >> >> > >> >> > of >> >> > >> >> > > > > those DAGs out there, and that it will overwhelm the >> >> > >metadata >> >> > >> >> > database, >> >> > >> >> > > > and >> >> > >> >> > > > > confuse the users. For an hourly DAG is would mean 24 >> >> > >> >artifact per >> >> > >> >> > day >> >> > >> >> > > > > instead of 1000+ >> >> > >> >> > > > > >> >> > >> >> > > > > If backwards compatibility in behavior is a concern, >> I'd >> >> > >> >recommend >> >> > >> >> > > > adding a >> >> > >> >> > > > > flag to the DAG class and/or config and make sure we're >> >> > >doing >> >> > >> >the >> >> > >> >> > right >> >> > >> >> > > > > thing by default. People who want backward >> compatibility >> >> > >> >would have >> >> > >> >> > to >> >> > >> >> > > > > change that default. But again, that's a lot of extra >> and >> >> > >> >confusing >> >> > >> >> > > > > complexity that will likely be the source of bugs and >> user >> >> > >> >confusion. >> >> > >> >> > > > > Having a clear, easy to reason about execution model is >> >> > >super >> >> > >> >> > > important. >> >> > >> >> > > > > >> >> > >> >> > > > > Think about visualizing a DAG that shapeshifted 5 times >> >> > >> >during its >> >> > >> >> > > > > execution, how does anyone make sense of that? >> >> > >> >> > > > > >> >> > >> >> > > > > Max >> >> > >> >> > > > > >> >> > >> >> > > > > On Tue, Jul 28, 2020 at 3:14 AM Kaxil Naik >> >> > >> ><[email protected]> >> >> > >> >> > > wrote: >> >> > >> >> > > > > >> >> > >> >> > > > > > Thanks Max for your comments. >> >> > >> >> > > > > > >> >> > >> >> > > > > > >> >> > >> >> > > > > > *DAG Fingerprinting: *this can be tricky, especially >> in >> >> > >> >regards to >> >> > >> >> > > > > dynamic >> >> > >> >> > > > > > > DAGs, where in some cases each parsing of the DAG >> can >> >> > >> >result in a >> >> > >> >> > > > > > different >> >> > >> >> > > > > > > fingerprint. I think DAG and tasks attributes are >> left >> >> > >> >out from >> >> > >> >> > the >> >> > >> >> > > > > > > proposal that should be considered as part of the >> >> > >> >fingerprint, >> >> > >> >> > like >> >> > >> >> > > > > > trigger >> >> > >> >> > > > > > > rules or task start/end datetime. We should do a >> full >> >> > >> >pass of all >> >> > >> >> > > DAG >> >> > >> >> > > > > > > arguments and make sure we're not forgetting >> anything >> >> > >> >that can >> >> > >> >> > > change >> >> > >> >> > > > > > > scheduling logic. Also, let's be careful that >> >> > >something >> >> > >> >as simple >> >> > >> >> > > as >> >> > >> >> > > > a >> >> > >> >> > > > > > > dynamic start or end date on a task could lead to a >> >> > >> >different >> >> > >> >> > > version >> >> > >> >> > > > > > each >> >> > >> >> > > > > > > time you parse. >> >> > >> >> > > > > > >> >> > >> >> > > > > > >> >> > >> >> > > > > > >> >> > >> >> > > > > > The short version of Dag Fingerprinting would be >> >> > >> >> > > > > > just a hash of the Serialized DAG. >> >> > >> >> > > > > > >> >> > >> >> > > > > > *Example DAG*: https://imgur.com/TVuoN3p >> >> > >> >> > > > > > *Example Serialized DAG*: https://imgur.com/LmA2Bpr >> >> > >> >> > > > > > >> >> > >> >> > > > > > It contains all the task & DAG parameters. When they >> >> > >> >change, >> >> > >> >> > > Scheduler >> >> > >> >> > > > > > writes >> >> > >> >> > > > > > a new version of Serialized DAGs to the DB. The >> >> > >Webserver >> >> > >> >then >> >> > >> >> > reads >> >> > >> >> > > > the >> >> > >> >> > > > > > DAGs from the DB. >> >> > >> >> > > > > > >> >> > >> >> > > > > > I'd recommend limiting serialization/storage of one >> >> > >version >> >> > >> >> > > > > > > per DAG Run, as opposed to potentially everytime >> the >> >> > >DAG >> >> > >> >is >> >> > >> >> > parsed >> >> > >> >> > > - >> >> > >> >> > > > > once >> >> > >> >> > > > > > > the version for a DAG run is pinned, >> fingerprinting is >> >> > >> >not >> >> > >> >> > > > re-evaluated >> >> > >> >> > > > > > > until the next DAG run is ready to get created. >> >> > >> >> > > > > > >> >> > >> >> > > > > > >> >> > >> >> > > > > > This is to handle Scenario 3 where a DAG structure is >> >> > >> >changed >> >> > >> >> > > mid-way. >> >> > >> >> > > > > > Since we don't intend to >> >> > >> >> > > > > > change the execution behaviour, if we limit Storage >> of 1 >> >> > >> >version >> >> > >> >> > per >> >> > >> >> > > > DAG, >> >> > >> >> > > > > > it won't actually show what >> >> > >> >> > > > > > was run. >> >> > >> >> > > > > > >> >> > >> >> > > > > > Example Dag v1: Task A -> Task B -> Task C >> >> > >> >> > > > > > The worker has completed the execution of Task B and >> is >> >> > >> >just about >> >> > >> >> > to >> >> > >> >> > > > > > complete the execution of Task B. >> >> > >> >> > > > > > >> >> > >> >> > > > > > The 2nd version of DAG is deployed: Task A -> Task D >> >> > >> >> > > > > > Now Scheduler queued Task D and it will run to >> >> > >completion. >> >> > >> >(Task C >> >> > >> >> > > > won't >> >> > >> >> > > > > > run) >> >> > >> >> > > > > > >> >> > >> >> > > > > > In this case, "the actual representation of the DAG" >> >> > >that >> >> > >> >run is >> >> > >> >> > > > neither >> >> > >> >> > > > > v1 >> >> > >> >> > > > > > nor v2 but a "mixed version" >> >> > >> >> > > > > > (Task A -> Task B -> Task D). The plan is that the >> >> > >> >Scheduler will >> >> > >> >> > > > create >> >> > >> >> > > > > > this "mixed version" based on what ran >> >> > >> >> > > > > > and the Graph View would show this "mixed version". >> >> > >> >> > > > > > >> >> > >> >> > > > > > There would also be a toggle button on the Graph >> View to >> >> > >> >select v1 >> >> > >> >> > or >> >> > >> >> > > > v2 >> >> > >> >> > > > > > where the tasks will be highlighted to show >> >> > >> >> > > > > > that a particular task was in v1 or v2 as shown in >> >> > >> >> > > > > > >> >> > >> >> > > > > > >> >> > >> >> > > > > >> >> > >> >> > > > >> >> > >> >> > > >> >> > >> >> > >> >> > >> >> >> > >> >> >> > >> https://cwiki.apache.org/confluence/download/attachments/158868919/Picture%201.png?version=2&modificationDate=1595612863000&api=v2 >> >> > >> >> > > > > > >> >> > >> >> > > > > > >> >> > >> >> > > > > > >> >> > >> >> > > > > > *Visualizing change in the tree view:* I think this >> is >> >> > >very >> >> > >> >complex >> >> > >> >> > > and >> >> > >> >> > > > > > > many things can make this view impossible to render >> >> > >(task >> >> > >> >> > > dependency >> >> > >> >> > > > > > > reversal, cycles across versions, ...). Maybe a >> better >> >> > >> >visual >> >> > >> >> > > > approach >> >> > >> >> > > > > > > would be to render independent, individual tree >> views >> >> > >for >> >> > >> >each >> >> > >> >> > DAG >> >> > >> >> > > > > > version >> >> > >> >> > > > > > > (side by side), and doing best effort aligning the >> >> > >tasks >> >> > >> >across >> >> > >> >> > > > blocks >> >> > >> >> > > > > > and >> >> > >> >> > > > > > > "linking" tasks with lines across blocks when >> >> > >necessary. >> >> > >> >> > > > > > >> >> > >> >> > > > > > >> >> > >> >> > > > > > Agreed, the plan is to do the best effort aligning. >> >> > >> >> > > > > > At this point in time, task additions to the end of >> the >> >> > >DAG >> >> > >> >are >> >> > >> >> > > > expected >> >> > >> >> > > > > to >> >> > >> >> > > > > > be compatible, >> >> > >> >> > > > > > but changes to task structure within the DAG may >> cause >> >> > >the >> >> > >> >tree >> >> > >> >> > view >> >> > >> >> > > > not >> >> > >> >> > > > > to >> >> > >> >> > > > > > incorporate “old” and “new” in the same view, hence >> that >> >> > >> >won't be >> >> > >> >> > > > shown. >> >> > >> >> > > > > > >> >> > >> >> > > > > > Regards, >> >> > >> >> > > > > > Kaxil >> >> > >> >> > > > > > >> >> > >> >> > > > > > On Mon, Jul 27, 2020 at 6:02 PM Maxime Beauchemin < >> >> > >> >> > > > > > [email protected]> wrote: >> >> > >> >> > > > > > >> >> > >> >> > > > > > > Some notes and ideas: >> >> > >> >> > > > > > > >> >> > >> >> > > > > > > *DAG Fingerprinting: *this can be tricky, >> especially >> >> > >in >> >> > >> >regards >> >> > >> >> > to >> >> > >> >> > > > > > dynamic >> >> > >> >> > > > > > > DAGs, where in some cases each parsing of the DAG >> can >> >> > >> >result in a >> >> > >> >> > > > > > different >> >> > >> >> > > > > > > fingerprint. I think DAG and tasks attributes are >> left >> >> > >> >out from >> >> > >> >> > the >> >> > >> >> > > > > > > proposal that should be considered as part of the >> >> > >> >fingerprint, >> >> > >> >> > like >> >> > >> >> > > > > > trigger >> >> > >> >> > > > > > > rules or task start/end datetime. We should do a >> full >> >> > >> >pass of all >> >> > >> >> > > DAG >> >> > >> >> > > > > > > arguments and make sure we're not forgetting >> anything >> >> > >> >that can >> >> > >> >> > > change >> >> > >> >> > > > > > > scheduling logic. Also, let's be careful that >> >> > >something >> >> > >> >as simple >> >> > >> >> > > as >> >> > >> >> > > > a >> >> > >> >> > > > > > > dynamic start or end date on a task could lead to a >> >> > >> >different >> >> > >> >> > > version >> >> > >> >> > > > > > each >> >> > >> >> > > > > > > time you parse. I'd recommend limiting >> >> > >> >serialization/storage of >> >> > >> >> > one >> >> > >> >> > > > > > version >> >> > >> >> > > > > > > per DAG Run, as opposed to potentially everytime >> the >> >> > >DAG >> >> > >> >is >> >> > >> >> > parsed >> >> > >> >> > > - >> >> > >> >> > > > > once >> >> > >> >> > > > > > > the version for a DAG run is pinned, >> fingerprinting is >> >> > >> >not >> >> > >> >> > > > re-evaluated >> >> > >> >> > > > > > > until the next DAG run is ready to get created. >> >> > >> >> > > > > > > >> >> > >> >> > > > > > > *Visualizing change in the tree view:* I think >> this is >> >> > >> >very >> >> > >> >> > complex >> >> > >> >> > > > and >> >> > >> >> > > > > > > many things can make this view impossible to render >> >> > >(task >> >> > >> >> > > dependency >> >> > >> >> > > > > > > reversal, cycles across versions, ...). Maybe a >> better >> >> > >> >visual >> >> > >> >> > > > approach >> >> > >> >> > > > > > > would be to render independent, individual tree >> views >> >> > >for >> >> > >> >each >> >> > >> >> > DAG >> >> > >> >> > > > > > version >> >> > >> >> > > > > > > (side by side), and doing best effort aligning the >> >> > >tasks >> >> > >> >across >> >> > >> >> > > > blocks >> >> > >> >> > > > > > and >> >> > >> >> > > > > > > "linking" tasks with lines across blocks when >> >> > >necessary. >> >> > >> >> > > > > > > >> >> > >> >> > > > > > > On Fri, Jul 24, 2020 at 12:46 PM Vikram Koka < >> >> > >> >> > [email protected] >> >> > >> >> > > > >> >> > >> >> > > > > > wrote: >> >> > >> >> > > > > > > >> >> > >> >> > > > > > > > Team, >> >> > >> >> > > > > > > > >> >> > >> >> > > > > > > > >> >> > >> >> > > > > > > > >> >> > >> >> > > > > > > > We just created 'AIP-36 DAG Versioning' on >> >> > >Confluence >> >> > >> >and would >> >> > >> >> > > > very >> >> > >> >> > > > > > much >> >> > >> >> > > > > > > > appreciate feedback and suggestions from the >> >> > >community. >> >> > >> >> > > > > > > > >> >> > >> >> > > > > > > > >> >> > >> >> > > > > > > > >> >> > >> >> > > > > > > > >> >> > >> >> > > > > > > >> >> > >> >> > > > > > >> >> > >> >> > > > > >> >> > >> >> > > > >> >> > >> >> > > >> >> > >> >> > >> >> > >> >> >> > >> >> >> > >> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-36+DAG+Versioning >> >> > >> >> > > > > > > > >> >> > >> >> > > > > > > > >> >> > >> >> > > > > > > > >> >> > >> >> > > > > > > > The DAG Versioning concept has been discussed on >> >> > >> >multiple >> >> > >> >> > > occasions >> >> > >> >> > > > > in >> >> > >> >> > > > > > > the >> >> > >> >> > > > > > > > past and has been a topic highlighted as part of >> >> > >> >Airflow 2.0 as >> >> > >> >> > > > well. >> >> > >> >> > > > > > We >> >> > >> >> > > > > > > at >> >> > >> >> > > > > > > > Astronomer have heard data engineers at several >> >> > >> >enterprises ask >> >> > >> >> > > > about >> >> > >> >> > > > > > > this >> >> > >> >> > > > > > > > feature as well, for easier debugging when >> changes >> >> > >are >> >> > >> >made to >> >> > >> >> > > DAGs >> >> > >> >> > > > > as >> >> > >> >> > > > > > a >> >> > >> >> > > > > > > > result of evolving business needs. >> >> > >> >> > > > > > > > >> >> > >> >> > > > > > > > >> >> > >> >> > > > > > > > As described in the AIP, we have a proposal >> focused >> >> > >on >> >> > >> >ensuring >> >> > >> >> > > > that >> >> > >> >> > > > > > the >> >> > >> >> > > > > > > > visibility behaviour of Airflow is correct, >> without >> >> > >> >changing >> >> > >> >> > the >> >> > >> >> > > > > > > execution >> >> > >> >> > > > > > > > behaviour. We considered changing the execution >> >> > >> >behaviour as >> >> > >> >> > > well, >> >> > >> >> > > > > but >> >> > >> >> > > > > > > > decided that the risks in changing execution >> >> > >behavior >> >> > >> >were too >> >> > >> >> > > high >> >> > >> >> > > > > as >> >> > >> >> > > > > > > > compared to the benefits and therefore decided to >> >> > >limit >> >> > >> >the >> >> > >> >> > scope >> >> > >> >> > > > to >> >> > >> >> > > > > > only >> >> > >> >> > > > > > > > making sure that the visibility was correct. >> >> > >> >> > > > > > > > >> >> > >> >> > > > > > > > >> >> > >> >> > > > > > > > We would like to attempt this based on our >> >> > >experience >> >> > >> >running >> >> > >> >> > > > Airflow >> >> > >> >> > > > > > as >> >> > >> >> > > > > > > a >> >> > >> >> > > > > > > > service. We believe that this benefits Airflow >> as a >> >> > >> >project and >> >> > >> >> > > the >> >> > >> >> > > > > > > > development experience of data engineers using >> >> > >Airflow >> >> > >> >across >> >> > >> >> > the >> >> > >> >> > > > > > world. >> >> > >> >> > > > > > > > >> >> > >> >> > > > > > > > >> >> > >> >> > > > > > > > Any feedback, suggestions, and comments would be >> >> > >> >greatly >> >> > >> >> > > > > appreciated. >> >> > >> >> > > > > > > > >> >> > >> >> > > > > > > > >> >> > >> >> > > > > > > > >> >> > >> >> > > > > > > > Best Regards, >> >> > >> >> > > > > > > > >> >> > >> >> > > > > > > > >> >> > >> >> > > > > > > > Kaxil Naik, Ryan Hamilton, Ash Berlin-Taylor, and >> >> > >> >Vikram Koka >> >> > >> >> > > > > > > > >> >> > >> >> > > > > > > >> >> > >> >> > > > > > >> >> > >> >> > > > > >> >> > >> >> > > > >> >> > >> >> > > >> >> > >> >> > > >> >> > >> >> > > -- >> >> > >> >> > > >> >> > >> >> > > Jacob Ward | Graduate Data Infrastructure Engineer >> >> > >> >> > > >> >> > >> >> > > [email protected] >> >> > >> >> > > >> >> > >> >> > > >> >> > >> >> > > NEW YORK | BOSTON | BRIGHTON | LONDON | BERLIN | >> >> > >> >STUTTGART | >> >> > >> >> > > PARIS | SINGAPORE | SYDNEY >> >> > >> >> > > >> >> > >> >> > >> >> > >> > >> >> > >> > >> >> > >> > >> >> > >> >-- >> >> > >> > >> >> > >> >Jarek Potiuk >> >> > >> >Polidea | Principal Software Engineer >> >> > >> > >> >> > >> >M: +48 660 796 129 >> >> > >> >> >> >> >> >> -- >> >> >> >> Jarek Potiuk >> >> Polidea <https://www.polidea.com/> | Principal Software Engineer >> >> >> >> M: +48 660 796 129 <+48660796129> >> >> [image: Polidea] <https://www.polidea.com/> >> >>
