Ty Xiangdong, my bad there. Attached the file to this email and also uploaded it here <https://photos.app.goo.gl/Rr5BsHvxXEXnbY5K7> and here <https://imgur.com/ncqqQgc>.
Cheers, Kevin Y On Fri, Mar 8, 2019 at 3:42 AM Deng Xiaodong <xd.den...@gmail.com> wrote: > Hi Kevin, > > The image you attached is not displayed properly. May you consider > uploading it somewhere then provide a link instead? > > Thanks! > > XD > > On Fri, Mar 8, 2019 at 19:38 Kevin Yang <yrql...@gmail.com> wrote: > > > Hi all, > > When I was preparing some work related to this AIP I found something very > > concerning. I noticed this JIRA ticket > > <https://issues.apache.org/jira/browse/AIRFLOW-3562> is trying to remove > > the dependency of dagbag from webserver, which is awesome--we wanted > badly > > but never got to start work on. However when I looked at some subtasks of > > it, which try to remove dagbag dependency from each endpoint, I found the > > way we remove the dependency of dagbag is not very ideal. For example > this > > PR <https://github.com/apache/airflow/pull/4867/files> will require us > to > > parse the dag file each time we hit the endpoint. > > > > If we go down this path, we indeed can get rid of the dagbag dependency > > easily, but we will have to 1. increase the DB load( not too concerning > at > > the moment ), 2. wait the DAG file to be parsed before getting the page > > back, potentially multiple times. DAG file can sometimes take quite a > while > > to parse, e.g. we have some framework DAG files generating large number > of > > DAGs from some static config files or even jupyter notebooks and they can > > take 30+ seconds to parse. Yes we don't like large DAG files but people > do > > see the beauty of code as config and sometimes heavily abuseleverage it. > > Assuming all users have the same nice small python file that can be > parsed > > fast, I'm still a bit worried about this approach. Continuing on this > path > > means we've chosen DagModel to be the serialized representation of DAG > and > > DB columns to hold different properties--it can be one candidate but I > > don't know if we should settle on that now. I would personally prefer a > > more compact, e.g. JSON5, and easy to scale representation( such that > > serializing new fields != DB upgrade). > > > > In my imagination we would have to collect the list of dynamic features > > depending on unserializable fields of a DAG and start a discussion/vote > on > > dropping support of them( I'm working on this but if anyone has already > > done so please take over), decide on the serialized representation of a > DAG > > and then replace dagbag with it in webserver. Per previous discussion and > > some offline discussions with Dan, one future of DAG serialization that I > > like would look similar to this: > > [image: airflow_new_arch.jpg] > > We can still discuss/vote which approach we want to take but I don't want > > the door to above design to be shut right now or we have to spend a lot > > effort switch path later. > > > > Bas and Peter, I'm very sorry to extend the discussion but I do think > this > > is tightly related to the AIP and PRs behind it. And my sincere apology > for > > bringing this up so late( I only pull the open PR list occasionally, if > > there's a way to subscribe to new PR event I'd love to know how). > > > > Cheers, > > Kevin Y > > > > On Thu, Feb 28, 2019 at 1:36 PM Peter van t Hof <pjrvant...@gmail.com> > > wrote: > > > >> Hi all, > >> > >> Just some comments one the point Bolke dit give in relation of my PR. > >> > >> At first, the main focus is: making the webserver stateless. > >> > >> > 1) Make the webserver stateless: needs the graph of the *current* dag > >> > >> This is the main goal but for this a lot more PR’s will be coming once > my > >> current is merged. For edges and graph view this is covered in my PR > >> already. > >> > >> > 2) Version dags: for consistency mainly and not requiring parsing of > the > >> > dag on every loop > >> > >> In my PR the historical graphs will be stored for each DagRun. This > means > >> that you can see if an older DagRun was the same graph structure, even > if > >> some tasks does not exists anymore in the current graph. Especially for > >> dynamic DAG’s this is very useful. > >> > >> > 3) Make the scheduler not require DAG files. This could be done if the > >> > edges contain all information when to trigger the next task. We can > then > >> > have event driven dag parsing outside of the scheduler loop, ie. by > the > >> > cli. Storage can also be somewhere else (git, artifactory, filesystem, > >> > whatever). > >> > >> The scheduler is almost untouched in this PR. The only thing that is > >> added is that this edges are saved to the database but the scheduling > >> itself din’t change. The scheduler depends now still on the DAG object. > >> > >> > 4) Fully serialise the dag so it becomes transferable to workers > >> > >> It nice to see that people has a lot of idea’s about this. But as Fokko > >> already mentioned this is out of scope for the issue what we are trying > to > >> solve. I also have some idea’s about this but I like to limit this > PR/AIP > >> to the webserver. > >> > >> For now my PR does solve 1 and 2 and the rest of the behaviour (like > >> scheduling) is untouched. > >> > >> Gr, > >> Peter > >> > >> >