Hi Kevin, The image you attached is not displayed properly. May you consider uploading it somewhere then provide a link instead?
Thanks! XD On Fri, Mar 8, 2019 at 19:38 Kevin Yang <yrql...@gmail.com> wrote: > Hi all, > When I was preparing some work related to this AIP I found something very > concerning. I noticed this JIRA ticket > <https://issues.apache.org/jira/browse/AIRFLOW-3562> is trying to remove > the dependency of dagbag from webserver, which is awesome--we wanted badly > but never got to start work on. However when I looked at some subtasks of > it, which try to remove dagbag dependency from each endpoint, I found the > way we remove the dependency of dagbag is not very ideal. For example this > PR <https://github.com/apache/airflow/pull/4867/files> will require us to > parse the dag file each time we hit the endpoint. > > If we go down this path, we indeed can get rid of the dagbag dependency > easily, but we will have to 1. increase the DB load( not too concerning at > the moment ), 2. wait the DAG file to be parsed before getting the page > back, potentially multiple times. DAG file can sometimes take quite a while > to parse, e.g. we have some framework DAG files generating large number of > DAGs from some static config files or even jupyter notebooks and they can > take 30+ seconds to parse. Yes we don't like large DAG files but people do > see the beauty of code as config and sometimes heavily abuseleverage it. > Assuming all users have the same nice small python file that can be parsed > fast, I'm still a bit worried about this approach. Continuing on this path > means we've chosen DagModel to be the serialized representation of DAG and > DB columns to hold different properties--it can be one candidate but I > don't know if we should settle on that now. I would personally prefer a > more compact, e.g. JSON5, and easy to scale representation( such that > serializing new fields != DB upgrade). > > In my imagination we would have to collect the list of dynamic features > depending on unserializable fields of a DAG and start a discussion/vote on > dropping support of them( I'm working on this but if anyone has already > done so please take over), decide on the serialized representation of a DAG > and then replace dagbag with it in webserver. Per previous discussion and > some offline discussions with Dan, one future of DAG serialization that I > like would look similar to this: > [image: airflow_new_arch.jpg] > We can still discuss/vote which approach we want to take but I don't want > the door to above design to be shut right now or we have to spend a lot > effort switch path later. > > Bas and Peter, I'm very sorry to extend the discussion but I do think this > is tightly related to the AIP and PRs behind it. And my sincere apology for > bringing this up so late( I only pull the open PR list occasionally, if > there's a way to subscribe to new PR event I'd love to know how). > > Cheers, > Kevin Y > > On Thu, Feb 28, 2019 at 1:36 PM Peter van t Hof <pjrvant...@gmail.com> > wrote: > >> Hi all, >> >> Just some comments one the point Bolke dit give in relation of my PR. >> >> At first, the main focus is: making the webserver stateless. >> >> > 1) Make the webserver stateless: needs the graph of the *current* dag >> >> This is the main goal but for this a lot more PR’s will be coming once my >> current is merged. For edges and graph view this is covered in my PR >> already. >> >> > 2) Version dags: for consistency mainly and not requiring parsing of the >> > dag on every loop >> >> In my PR the historical graphs will be stored for each DagRun. This means >> that you can see if an older DagRun was the same graph structure, even if >> some tasks does not exists anymore in the current graph. Especially for >> dynamic DAG’s this is very useful. >> >> > 3) Make the scheduler not require DAG files. This could be done if the >> > edges contain all information when to trigger the next task. We can then >> > have event driven dag parsing outside of the scheduler loop, ie. by the >> > cli. Storage can also be somewhere else (git, artifactory, filesystem, >> > whatever). >> >> The scheduler is almost untouched in this PR. The only thing that is >> added is that this edges are saved to the database but the scheduling >> itself din’t change. The scheduler depends now still on the DAG object. >> >> > 4) Fully serialise the dag so it becomes transferable to workers >> >> It nice to see that people has a lot of idea’s about this. But as Fokko >> already mentioned this is out of scope for the issue what we are trying to >> solve. I also have some idea’s about this but I like to limit this PR/AIP >> to the webserver. >> >> For now my PR does solve 1 and 2 and the rest of the behaviour (like >> scheduling) is untouched. >> >> Gr, >> Peter >> >>